Quick Links

CPU costs of random_zipfian in pgbench

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	CPU costs of random_zipfian in pgbench
Date:	2019-02-17 00:30:56
Message-ID:	b5e172e9-ad22-48a3-86a3-589afa20e8f7@2ndquadrant.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

I'm trying to use random_zipfian() for benchmarking of skewed data sets,
and I ran head-first into an issue with rather excessive CPU costs.

Consider an example like this:

pgbench -i -s 10000 test

pgbench -s 10000 -f zipf.sql -T 30 test

where zipf.sql does this:

\SET id random_zipfian(1, 100000 * :scale, 0.1)
BEGIN;
SELECT * FROM pgbench_accounts WHERE aid = :id;
END;

Unfortunately, this produces something like this:

transaction type: zipf.sql
scaling factor: 10000
query mode: simple
number of clients: 1
number of threads: 1
duration: 30 s
number of transactions actually processed: 1
latency average = 43849.143 ms
tps = 0.022805 (including connections establishing)
tps = 0.022806 (excluding connections establishing)

which is somewhat ... not great, I guess. This happens because
generalizedHarmonicNumber() does this:

for (i = n; i > 1; i--)
ans += pow(i, -s);

where n happens to be 1000000000 (range passed to random_zipfian), so
the loop takes quite a bit of time. So much that we only ever complete a
single transaction, because this work happens in the context of the
first transction, and so it counts into the 30-second limit.

The docs actually do mention performance of this function:

The function's performance is poor for parameter values close and
above 1.0 and on a small range.

But that does not seem to cover the example I just posted, because 0.1
is not particularly close to 1.0 (or above 1.0), and 1e9 values hardly
counts as "small range".

I see this part of random_zipfian comes from "Quickly Generating
Billion-Record Synthetic Databases" which deals with generating data
sets, where wasting a bit of time is not a huge deal. But when running
benchmarks it matters much more. So maybe there's a disconnect here?

Interestingly enough, I don't see this issue on values above 1.0, no
matter how close to 1.0 those are. Although the throughput seems lower
than with uniform access, so this part of random_zipfian is not quite
cheap either.

Now, I don't know what to do about this. Ideally, we'd have a faster
algorithm to generate zipfian distributions - I don't know if such thing
exists though. Or maybe we could precompute the distribution first, not
counting it into the benchmark duration.

But I guess the very least we can/should do is improving the docs to
make it more obvious which cases are expected to be slow.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Responses

Re: CPU costs of random_zipfian in pgbench at 2019-02-17 08:51:59 from Fabien COELHO

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tomas Vondra	2019-02-17 00:34:20	Re: Early WIP/PoC for inlining CTEs
Previous Message	Donald Dong	2019-02-16 23:10:44	Actual Cost