Re: BitmapHeapScan streaming read user and prelim refactoring

From: James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: BitmapHeapScan streaming read user and prelim refactoring
Date: 2025-04-15 17:58:50
Message-ID: CAJVSvF66dP-WpiVOnOV2Bj1YGKYXEz7w6Kns++Jv4caGyJ-8+A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for the comments!

On Tue, Apr 15, 2025 at 3:11 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2025-04-14 09:58:19 -0700, James Hunter wrote:

> > I see two orthogonal problems, in processing Bitmap Heap pages in
> > parallel: (1) we need to prefetch enough pages, far enough in advance,
> > to hide read latency; (2) later, every parallel worker needs to be
> > given a set of pages to process, in a way that minimizes contention.
> >
> > The easiest way to hand out work to parallel workers (and often the
> > best) is to maintain a single, shared, global work queue. Just put
> > whatever pages you prefetch into a FIFO queue, and let each worker
> > pull one piece of "work" off that queue. In this was, there's no
> > "ramp-down" problem.
>
> If you just issue prefetch requests separately you'll get no read combining -
> and it turns out that that is a really rather significant loss, both on the
> storage layer and just due to the syscall overhead. So you do need to perform
> batching when issuing IO. Which in turn requires a bit of rampup logic etc.

Right, so if you need to do batching anyway, contention on a shared
queue will be minimal, because it's amortized over the batch size.

I agree about ramp *up* logic, I just don't see the need for ramp *down* logic.

> > This is why a single shared queue is so nice, because it avoids
> > workers being idle. But I am confused by your proposal, which seems to
> > be trying to get the behavior of a single shared queue, but
> > implemented with the added complexity of multiple queues.
> >
> > Why not just use a single queue?
>
> Accessing buffers in a maximally interleaved way, which is what a single queue
> would give you, adds a good bit of overhead when you have a lot of memory,
> because e.g. TLB hit rate is minimized.

Well that's trade-off, right? As you point out, you need to do
batching when issuing reads, to allow for read combining. The larger
your batch, the more reads you can combine -- the more efficient your
I/O, etc. But the larger your batch, the less locality you get in
memory.

You always have to choose a batch size large enough to hide I/O
latency, plus allow, I guess, for read combining. I suspect that will
blow out your TLB more than letting 8 parallel workers share the same
queue.

Not to mention the complexity (as Thomas has described very nicely, in
this thread) of trying to partition+affinitize async read requests to
individual parallel workers. (Consider "ramp-down" for a moment: the
"problem" here is just that one parallel worker issued a batch of
async reads , near the end of the query; and since the worker is
affinitized to the async read, all other workers pack up and go home,
leaving a single worker to process this last batch. If, instead, we
just used a single queue, then there would be no need for "ramp-down"
logic, because async reads would go into a single queue/pool, and not
be affinitized to a single, "unlucky" worker.)

> > It has never been clear to me why prefetching the exact blocks you'll
> > later consume is seen as a *benefit*, rather than a *cost*. I'm not
> > aware of any prefetch interface, other than PG's "ReadStream," that
> > insists on this. But that's a separate discussion...
>
> ...
>
> As I said above, that's not to say that we'll only ever want to do readahead
> via a the read stream interface.

Well that's my point: since, I believe, we'll ultimately want a
"heuristic" prefetch, which will be incompatible with the new read
stream interface... we'll end up writing and supporting two different
prefetch interfaces.

It has never been clear to me that the advantages of having this
second, read-stream, prefetch interface outweigh the costs of having
to write and maintain two separate interfaces, to do pretty much the
same thing. If we *didn't* need the "heuristic" interface, then I
could be convinced that the "read-stream" interface was a good choice.
But since we'll (eventually) need the "heuristic" interface, anyway,
it's not clear to me that the benefits outweigh the costs of
implementing this "read-stream" interface, as well.

Thanks,
James

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Lakhin 2025-04-15 18:00:00 Re: AIO v2.5
Previous Message Jacob Champion 2025-04-15 17:53:07 Re: [PoC] Federated Authn/z with OAUTHBEARER