Quick Links

Re: Optimize WindowAgg's use of tuplestores

From:	Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	David Rowley <dgrowley(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Optimize WindowAgg's use of tuplestores
Date:	2024-07-12 06:29:12
Message-ID:	CAExHW5sAak2LJ9Xp+EBqM2Chj5nsfYWbyq6F-7YhK5SiHu5WFA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Jul 11, 2024 at 5:39 PM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
>
> On Wed, 10 Jul 2024 at 02:42, Ashutosh Bapat
> <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
> > Observations
> > 1. The numbers corresponding to 10 and 100 partitions are higher when
> > patched. That might be just noise. I don't see any reason why it would
> > impact negatively when there are a small number of partitions. The
> > lower partition cases also have a higher number of rows per partition,
> > so is the difference between MemoryContextDelete() vs
> > MemoryContextReset() making any difference here. May be worth
> > verifying those cases carefully. Otherwise upto 1000 partitions, it
> > doesn't show any differences.
>
> I think this might just be noise as a result of rearranging code. In
> terms of C code, I don't see any reason for it to be slower. If you
> look at GenerationDelete() (as what is getting called from
> MemoryContextDelete()), it just calls GenerationReset(). So resetting
> is going to always be less work than deleting the context, especially
> given we don't need to create the context again when we reset it.
>
> I wrote the attached script to see if I can also see the slowdown and
> I do see the patched code come out slightly slower (within noise
> levels) in lower partition counts.
>
> To get my compiler to produce code in a more optimal order for the
> common case, I added unlikely() to the "if (winstate->all_first)"
> condition. This is only evaluated on the first time after a rescan,
> so putting that code at the end of the function makes more sense. The
> attached v2 patch has it this way. You can see the numbers look
> slightly better in the attached graph.

The change to all_first seems unrelated to the tuplestore
optimization. But it's bringing the results inline with the master for
lower number of partitions.

Thanks for the script. I have similar results on my laptop.
From master
Testing with 1000000 partitions
latency average = 505.738 ms
latency average = 509.407 ms
latency average = 522.461 ms
Testing with 100000 partitions
latency average = 329.026 ms
latency average = 327.504 ms
latency average = 342.556 ms
Testing with 10000 partitions
latency average = 299.496 ms
latency average = 298.266 ms
latency average = 306.773 ms
Testing with 1000 partitions
latency average = 299.006 ms
latency average = 302.188 ms
latency average = 301.701 ms
Testing with 100 partitions
latency average = 305.411 ms
latency average = 286.935 ms
latency average = 302.432 ms
Testing with 10 partitions
latency average = 288.091 ms
latency average = 294.506 ms
latency average = 305.082 ms
Testing with 1 partitions
latency average = 301.121 ms
latency average = 319.615 ms
latency average = 301.141 ms

Patched
Testing with 1000000 partitions
latency average = 351.683 ms
latency average = 352.516 ms
latency average = 352.086 ms
Testing with 100000 partitions
latency average = 300.626 ms
latency average = 303.584 ms
latency average = 306.959 ms
Testing with 10000 partitions
latency average = 289.560 ms
latency average = 302.248 ms
latency average = 297.423 ms
Testing with 1000 partitions
latency average = 308.600 ms
latency average = 299.215 ms
latency average = 289.681 ms
Testing with 100 partitions
latency average = 301.216 ms
latency average = 286.240 ms
latency average = 291.232 ms
Testing with 10 partitions
latency average = 305.260 ms
latency average = 296.707 ms
latency average = 300.266 ms
Testing with 1 partitions
latency average = 316.199 ms
latency average = 314.043 ms
latency average = 309.425 ms

Now that you are also seeing the slowdown with your earlier patch, I
am wondering whether adding unlikely() by itself is a good
optimization. There might be some other reason behind the perceived
slowdown. How do the numbers look when you just add unlikely() without
any other changes?

--
Best Wishes,
Ashutosh Bapat

In response to

Re: Optimize WindowAgg's use of tuplestores at 2024-07-11 12:09:03 from David Rowley

Responses

Re: Optimize WindowAgg's use of tuplestores at 2024-07-15 06:02:42 from Ashutosh Bapat

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Michael Paquier	2024-07-12 06:42:21	Re: Flush pgstats file during checkpoints
Previous Message	Tatsuo Ishii	2024-07-12 05:41:08	Re: Optimize WindowAgg's use of tuplestores