Re: pgbench rate limiting changes transaction latency computation

From: Andres Freund <andres(at)anarazel(dot)de>
To: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: pgbench rate limiting changes transaction latency computation
Date: 2019-06-11 07:12:21
Message-ID: 20190611071221.3qc2yyalsvvyilw6@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2019-06-11 08:36:55 +0200, Fabien COELHO wrote:
> > I noticed that pgbench's -R influences not just the computation of lag,
> > but also of latency. That doesn't look right to me, but maybe I'm just
> > missing something?
> >
> > It's quite easy to demonstrate when running pgbench with/without
> > progress report at a transaction rate that's around the limit of what
> > the server can do:
> >
> > andres(at)alap4:~/src/postgresql$ pgbench -n -M prepared -c 1 -j 1 -T 100000 -P1 -r -S pgbench_10
> > progress: 1.0 s, 37416.3 tps, lat 0.027 ms stddev 0.013
> > progress: 2.0 s, 37345.1 tps, lat 0.027 ms stddev 0.011
> > progress: 3.0 s, 38787.8 tps, lat 0.026 ms stddev 0.009
> > ...
> >
> > andres(at)alap4:~/src/postgresql$ pgbench -n -M prepared -c 1 -j 1 -T 100000 -P1 -r -S -R 37000 pgbench_10
> > progress: 1.0 s, 32792.8 tps, lat 81.795 ms stddev 35.552, lag 81.765 ms
> > progress: 2.0 s, 37770.6 tps, lat 113.194 ms stddev 4.651, lag 113.168 ms
> > progress: 3.0 s, 37006.3 tps, lat 113.905 ms stddev 5.007, lag 113.878 ms
>
> [...]
>
> > Fabien, is this a bug, or am I misunderstanding something?
>
> This behavior under -R is fully voluntary, and the result above just show
> that the database cannot really keep up with the load, which is simply the
> case, so for me it is okay to show bad figures.

I mean, you just turned one named value, into a different one, without
renaming it. And the new meaning under -R, is basically the same as one
that's already there (lag). Also note that it also can actually keep up
in the above example.

> The idea under throttling is to model a client which would want the result
> of a query at a certain point in time, say a query for a web page which is
> being generated, which is the scheduled time. It is the when the client
> knows it wants an answer. If it is not processed immediately, that is bad
> for its client perceived latency.

> Whether this is due to lag (i.e. the server is loaded and cannot start to
> process the answer) or because the server is slow to answer is irrelevant,
> the client is waiting, the web page is not generated, the system is slow. So
> latency under -R is really "client latency", not only query latency, as it
> is documented.

What does that have to do with incorporating the same data into both lag
and latency? I just fail to see what the point is, except to make it
unnecessarily harder to compare postgres' behaviour under both a
throttled and push-it-to-the-breaking point loads.

How long individual transactions take, and how much variance there is in
that, is something *crucial* to optimize for. *Especially* when the
machine/load is provisioned in a way to not overload the machine.

How is e.g.
progress: 1.6 s, 0.0 tps, lat 0.000 ms stddev 0.000, lag 0.000 ms
progress: 2.0 s, 103546.5 tps, lat 1584.161 ms stddev 35.589, lag 1582.043 ms
progress: 3.0 s, 108535.2 tps, lat 1347.619 ms stddev 101.782, lag 1346.170 ms
progress: 4.0 s, 108528.8 tps, lat 996.603 ms stddev 106.052, lag 995.159 ms
progress: 5.0 s, 109468.8 tps, lat 633.464 ms stddev 108.483, lag 632.030 ms
progress: 6.0 s, 110606.7 tps, lat 252.923 ms stddev 110.391, lag 251.505 ms
progress: 7.0 s, 84253.3 tps, lat 6.829 ms stddev 15.067, lag 6.423 ms
progress: 8.0 s, 80470.7 tps, lat 0.142 ms stddev 0.079, lag 0.017 ms
progress: 9.0 s, 80104.2 tps, lat 0.142 ms stddev 0.081, lag 0.017 ms
progress: 10.0 s, 80277.0 tps, lat 0.152 ms stddev 0.150, lag 0.017 ms

the lat column adds basically nothing over the lag column here.

more useful than:
progress: 1.3 s, 0.0 tps, lat 0.000 ms stddev 0.000, lag 0.000 ms
progress: 2.0 s, 116315.6 tps, lat 1.425 ms stddev 1.440, lag 1087.076 ms
progress: 3.0 s, 113526.2 tps, lat 1.390 ms stddev 0.408, lag 709.908 ms
progress: 4.0 s, 111816.4 tps, lat 1.407 ms stddev 0.399, lag 302.866 ms
progress: 5.0 s, 88061.9 tps, lat 0.543 ms stddev 0.652, lag 16.526 ms
progress: 6.0 s, 80045.4 tps, lat 0.128 ms stddev 0.079, lag 0.017 ms
progress: 7.0 s, 79636.3 tps, lat 0.124 ms stddev 0.073, lag 0.016 ms
progress: 8.0 s, 80535.3 tps, lat 0.125 ms stddev 0.073, lag 0.016 ms

where I can see that the transactions are now actually fast enough.
Obviously this is a toy example, but this really make -R close to
useless to me. I often want to switch from a unthrottled to a 90% load,
and improve the outlier beheaviour - but that outlier behaviour is
hidden due to this redefinition of lat (as the issue is now reported
over a much longer period of time, as it includes lag).

I think we should just restore lat to a sane behaviour under -R, and if
you want to have lat + lag as a separate column in -R mode, then let's
do that.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jamison, Kirk 2019-06-11 07:34:35 [PATCH] Speedup truncates of relation forks
Previous Message Amit Khandekar 2019-06-11 06:54:07 Re: Minimal logical decoding on standbys