Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: James Coleman <jtc331(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Shaun Thomas <shaun(dot)thomas(at)2ndquadrant(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Date: 2019-06-25 16:02:35
Message-ID: 20190625160235.6ctryjzvw5je7cet@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 24, 2019 at 07:34:19PM -0400, James Coleman wrote:
>On Mon, Jun 24, 2019 at 4:16 PM Tomas Vondra
><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> On Mon, Jun 24, 2019 at 01:00:50PM -0400, James Coleman wrote:
>> >On Mon, Jun 24, 2019 at 12:56 PM Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> >>
>> >> On Mon, 24 Jun 2019 at 16:10, James Coleman <jtc331(at)gmail(dot)com> wrote:
>> >>>
>> >>> On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote:
>...
>> >>> As I see it the two most significant concerning cases right now are:
>> >>> 1. Very large batches (in particular where the batch is effectively
>> >>> all of the matching rows such that we're really just doing a standard
>> >>> sort).
>> >>> 2. Many very small batches.
>> >>
>> >>
>> >> What is the specific use case for this? This sounds quite general case.
>> >
>> >They are both general cases in some sense, but the concerns lie mostly
>> >with what happens when they're unexpectedly encountered. For example,
>> >if the expected row count or group size is off by a good bit and we
>> >effectively have to perform a sort of all (or most) possible rows.
>> >
>> >If we can get the performance to a point where that misestimated row
>> >count or group size doesn't much matter, then ISTM including the patch
>> >becomes a much more obvious total win.
>> >
>>
>> Yes, that seems like a reasonable approach. Essentially, we're trying to
>> construct plausible worst case examples, and then minimize the overhead
>> compared to regular sort. If we get sufficiently close, then it's fine
>> to rely on somewhat shaky stats - like group size estimates.
>
>I have a bit of a mystery in my performance testing. I've been setting
>up a table like so:
>
>create table foo(pk serial primary key, owner_fk integer, created_at timestamp);
>insert into foo(owner_fk, created_at)
>select fk_t.i, now() - (time_t.i::text || ' minutes')::interval
>from generate_series(1, 10000) time_t(i)
>cross join generate_series(1, 1000) fk_t(i);
>-- double up on one set to guarantee matching prefixes
>insert into foo (owner_fk, created_at) select owner_fk, created_at
>from foo where owner_fk = 23;
>create index idx_foo_on_owner_and_created_at on foo(owner_fk, created_at);
>analyze foo;
>
>and then I have the following query:
>
>select *
>from foo
>where owner_fk = 23
>order by created_at desc, pk desc
>limit 20000;
>
>The idea here is to force a bit of a worst case for small groups: we
>have 10,000 batches (i.e., equal prefix groups) of 2 tuples each and
>then query with a limit matching the actual number of rows we know
>will match the query -- so even though there's a limit we're forcing a
>total sort (and also guaranteeing both plans have to touch the same
>number of rows). Note: I know that batches of size is actually the
>worst case, but I chose batches of two because I've also been testing
>a change that would skip the sort entirely for single tuple batches.
>
>On master (really the commit right before the current revision of the
>patch), I get:
>latency average = 14.271 ms
>tps = 70.075243 (excluding connections establishing)
>
>With the patch (and incremental sort enabled):
>latency average = 11.975 ms
>tps = 83.512090 (excluding connections establishing)
>
>With the patch (but incremental sort disabled):
>latency average = 11.884 ms
>tps = 84.149834 (excluding connections establishing)
>
>All of those are 60 seconds runs on pgbench with a single thread.
>
>So we have a very substantial speedup with patch *even if the new
>feature isn't enabled*. I've confirmed the plan looks the same on
>patched with incremental sort disabled and master. The only changes
>that would seem to really effect execution time would be the changes
>to tuplesort.c, but looking through them I don't see anything I'd
>expect to change things so dramatically.
>
>Any thoughts on this?
>

I can reproduce the same thing, so it's not just you. On my machine, I see
these tps numbers (average of 10 runs, 60 seconds each):

master: 65.177
patched (on): 80.368
patched (off): 80.750

The numbers are very consistent (within 1 tps).

I've done a bit of CPU profiling, and on master I see this:

13.84% postgres postgres [.] comparetup_heap
4.83% postgres postgres [.] qsort_tuple
3.87% postgres postgres [.] pg_ltostr_zeropad
3.55% postgres postgres [.] pg_ltoa
3.19% postgres postgres [.] AllocSetAlloc
2.68% postgres libc-2.28.so [.] __GI___strlen_sse2
2.38% postgres postgres [.] LWLockRelease
2.38% postgres postgres [.] AppendSeconds.constprop.9
2.22% postgres libc-2.28.so [.] __memmove_sse2_unaligned_erms
2.17% postgres postgres [.] GetPrivateRefCountEntry
2.03% postgres postgres [.] j2date
...

while on patched versions I see this:

4.60% postgres postgres [.] pg_ltostr_zeropad
4.51% postgres postgres [.] pg_ltoa
3.50% postgres postgres [.] AllocSetAlloc
3.34% postgres libc-2.28.so [.] __GI___strlen_sse2
2.99% postgres postgres [.] LWLockRelease
2.84% postgres postgres [.] AppendSeconds.constprop.9
2.65% postgres postgres [.] GetPrivateRefCountEntry
2.64% postgres postgres [.] j2date
2.60% postgres postgres [.] printtup
2.56% postgres postgres [.] heap_hot_search_buffer
...
1.35% postgres postgres [.] comparetup_heap
...

So either we're calling comparetup_heap less often, or it's cheaper.

But it seems to be very dependent on the data set. If you do this:

create table foo_2 as select * from foo order by random();
alter table foo_2 add primary key (pk);
create index idx_foo_2_on_owner_and_created_at on foo_2 (owner_fk, created_at);

and then run the test against this table, there's no difference.

So my guess is this particular data set triggers slightly different
behavior in tuplesort, reducing the cost of comparetup_heap. The speedup
is quite significant (~20% on my system), the question is how widely
applicable can it be.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message James Coleman 2019-06-25 16:52:45 Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Previous Message Tom Lane 2019-06-25 15:59:56 Re: psql UPDATE field [tab] expands to DEFAULT?