Re: BitmapHeapScan streaming read user and prelim refactoring

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Subject: Re: BitmapHeapScan streaming read user and prelim refactoring
Date: 2024-03-03 00:15:28
Message-ID: CAAKRu_b4OXkrc01P6a0X6F=YAXSCBrUV79wQL1k6WOCGOyT10A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Mar 2, 2024 at 6:59 PM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
>
>
> On 3/1/24 17:51, Melanie Plageman wrote:
> > On Fri, Mar 1, 2024 at 9:05 AM Tomas Vondra
> > <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> >>
> >> On 3/1/24 02:18, Melanie Plageman wrote:
> >>> On Thu, Feb 29, 2024 at 6:44 PM Tomas Vondra
> >>> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> >>>>
> >>>> On 2/29/24 23:44, Tomas Vondra wrote:
> >>>> 1) On master there's clear difference between eic=0 and eic=1 cases, but
> >>>> on the patched build there's literally no difference - for example the
> >>>> "uniform" distribution is clearly not great for prefetching, but eic=0
> >>>> regresses to eic=1 poor behavior).
> >>>
> >>> Yes, so eic=0 and eic=1 are identical with the streaming read API.
> >>> That is, eic 0 does not disable prefetching. Thomas is going to update
> >>> the streaming read API to avoid issuing an fadvise for the last block
> >>> in a range before issuing a read -- which would mean no prefetching
> >>> with eic 0 and eic 1. Not doing prefetching with eic 1 actually seems
> >>> like the right behavior -- which would be different than what master
> >>> is doing, right?
> >>
> >> I don't think we should stop doing prefetching for eic=1, or at least
> >> not based just on these charts. I suspect these "uniform" charts are not
> >> a great example for the prefetching, because it's about distribution of
> >> individual rows, and even a small fraction of rows may match most of the
> >> pages. It's great for finding strange behaviors / corner cases, but
> >> probably not a sufficient reason to change the default.
> >
> > Yes, I would like to see results from a data set where selectivity is
> > more correlated to pages/heap fetches. But, I'm not sure I see how
> > that is related to prefetching when eic = 1.
> >
> >> I think it makes sense to issue a prefetch one page ahead, before
> >> reading/processing the preceding one, and it's fairly conservative
> >> setting, and I assume the default was chosen for a reason / after
> >> discussion.
> >
> > Yes, I suppose the overhead of an fadvise does not compare to the IO
> > latency of synchronously reading that block. Actually, I bet the
> > regression I saw by accidentally moving BitmapAdjustPrefetchIterator()
> > after table_scan_bitmap_next_block() would be similar to the
> > regression introduced by making eic = 1 not prefetch.
> >
> > When you think about IO concurrency = 1, it doesn't imply prefetching
> > to me. But, I think we want to do the right thing and have parity with
> > master.
> >
> >> My suggestion would be to keep the master behavior unless not practical,
> >> and then maybe discuss changing the details later. The patch is already
> >> complicated enough, better to leave that discussion for later.
> >
> > Agreed. Speaking of which, we need to add back use of tablespace IO
> > concurrency for the streaming read API (which is used by
> > BitmapHeapScan in master).
> >
> >>> With very low selectivity, you are less likely to get readahead
> >>> (right?) and similarly less likely to be able to build up > 8kB IOs --
> >>> which is one of the main value propositions of the streaming read
> >>> code. I imagine that this larger read benefit is part of why the
> >>> performance is better at higher selectivities with the patch. This
> >>> might be a silly experiment, but we could try decreasing
> >>> MAX_BUFFERS_PER_TRANSFER on the patched version and see if the
> >>> performance gains go away.
> >>
> >> Sure, I can do that. Do you have any particular suggestion what value to
> >> use for MAX_BUFFERS_PER_TRANSFER?
> >
> > I think setting it to 1 would be the same as always master -- doing
> > only 8kB reads. The only thing about that is that I imagine the other
> > streaming read code has some overhead which might end up being a
> > regression on balance even with the prefetching if we aren't actually
> > using the ranges/vectored capabilities of the streaming read
> > interface. Maybe if you just run it for one of the very obvious
> > performance improvement cases? I can also try this locally.
> >
>
> Here's some results from a build with
>
> #define MAX_BUFFERS_PER_TRANSFER 1
>
> There are three columns:
>
> - master
> - patched (original patches, with MAX_BUFFERS_PER_TRANSFER=128kB)
> - patched-single (MAX_BUFFERS_PER_TRANSFER=8kB)
>
> The color scales are always branch compared to master.
>
> I think the expectation was that setting the transfer to 1 would make it
> closer to master, reducing some of the regressions. But in practice the
> effect is the opposite.
>
> - In "cached" runs, this eliminates the small improvements (light
> green), but leaves the regressions behind.

For cached runs, I actually would expect that MAX_BUFFERS_PER_TRANSFER
would eliminate the regressions. Pinning more buffers will only hurt
us for cached workloads. This is evidence that we may need to control
the number of pinned buffers differently when there has been a run of
fully cached blocks.

> - In "uncached" runs, this exacerbates the regressions, particularly for
> low selectivities (small values of matches).

For the uncached runs, I actually expected it to eliminate the
performance gains that we saw with the patches applied. With
MAX_BUFFERS_PER_TRANSFER=1, we don't get the benefit of larger IOs and
fewer system calls but we still have the overhead of the streaming
read machinery. I was hoping to prove that the performance
improvements we saw with all the patches applied were due to
MAX_BUFFERS_PER_TRANSFER being > 1 causing fewer, bigger reads.

It did eliminate some performance gains, however, primarily for
cyclic-fuzz at lower selectivities. I am a little confused by this
part because with lower selectivities there are likely fewer
consecutive blocks that can be combined into one IO.

And, on average, we still see a lot of performance improvements that
were not eliminated by MAX_BUFFERS_PER_TRANSFER = 1. Hmm.

- Melanie

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andy Fan 2024-03-03 01:52:16 Re: Shared detoast Datum proposal
Previous Message Peter Smith 2024-03-02 23:59:40 Re: pub/sub - specifying optional parameters without values.