Re: BitmapHeapScan streaming read user and prelim refactoring

From: Andres Freund <andres(at)anarazel(dot)de>
To: James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: BitmapHeapScan streaming read user and prelim refactoring
Date: 2025-04-14 17:20:26
Message-ID: rmiw7n4ijedmphzcs6zwm6qoyhnifkj2ivmfob4xbh77iuehyw@rcp3eiafgt7q
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-04-14 09:58:19 -0700, James Hunter wrote:
> Of course, late feedback is a burden, but I think our discussion here
> (and, in the future, if you try to "ReadStream" BTree index pages,
> themselves) illustrates my point.

FWIW, it's quite conceivable that we'll want to use non-readstream prefetching
for some cases where "heuristic" prefetching is easier. I think prefetching of
btree pages is one of those cases.

The read stream based API provides a single point of adjusting the readahead
logic, implementing read-combining etc, without needing to know about that in
the user of the read stream.

> I see two orthogonal problems, in processing Bitmap Heap pages in
> parallel: (1) we need to prefetch enough pages, far enough in advance,
> to hide read latency; (2) later, every parallel worker needs to be
> given a set of pages to process, in a way that minimizes contention.
>
> The easiest way to hand out work to parallel workers (and often the
> best) is to maintain a single, shared, global work queue. Just put
> whatever pages you prefetch into a FIFO queue, and let each worker
> pull one piece of "work" off that queue. In this was, there's no
> "ramp-down" problem.

If you just issue prefetch requests separately you'll get no read combining -
and it turns out that that is a really rather significant loss, both on the
storage layer and just due to the syscall overhead. So you do need to perform
batching when issuing IO. Which in turn requires a bit of rampup logic etc.

> If you find that contention on this shared queue becomes a bottleneck,
> then you just pull *n* pieces of work, in a batch. Then the
> "ramp-down" problem is limited to "n", instead of just 1. Often, one
> can find a suitable value of n that simultaneously makes contention
> effectively zero, while avoiding "ramp-down" problems; say n = 10.
>
> So much for popping from the shared queue. Pushing to the shared queue
> is also easy, because you have async reads. Anytime a worker pops a
> (batch of) work item(s) off the shared queue, it checks to see if the
> queue is still large enough. If not, it issues the appropriate
> prefetch / "ReadStream" calls.
>
> A single shared queue is easiest, but sometimes there's no way to
> prevent it from becoming a bottleneck. In that case, one typically
> partitions the input at startup, gives each worker its own partition,
> and waits for all workers to complete. In this, second, model, workers
> are entirely independent, so there is no contention: we scale out
> perfectly. The problem, as you've pointed out, is that one worker
> might finish its own work long before another; and then the worker
> that finished its work is idle and therefore wasted.
>
> This is why a single shared queue is so nice, because it avoids
> workers being idle. But I am confused by your proposal, which seems to
> be trying to get the behavior of a single shared queue, but
> implemented with the added complexity of multiple queues.
>
> Why not just use a single queue?

Accessing buffers in a maximally interleaved way, which is what a single queue
would give you, adds a good bit of overhead when you have a lot of memory,
because e.g. TLB hit rate is minimized.

> > These are now
> > real IOs running in the background and for the *exact* blocks you will
> > consume; posix_fadvise() was just a stepping towards AIO that
> > tolerated sloppy synchronisation including being entirely wrong.
>
> It has never been clear to me why prefetching the exact blocks you'll
> later consume is seen as a *benefit*, rather than a *cost*. I'm not
> aware of any prefetch interface, other than PG's "ReadStream," that
> insists on this. But that's a separate discussion...

It (randomly ordered):

a) reduces wasted IO

b) makes things like IO combining a lot easier

c) makes it a lot easier to adaptively control readahead distance, because you
have information about
1) the # in-flight IOs
2) the # completed IOs
3) current position
4) reada

d) provides *per-stream* control over readahead. You want to be only as
aggressive about prefetching as you need to be, which very well can differ
between different parts of a query tree.

e) provides one central point to implement readahead logic

As I said above, that's not to say that we'll only ever want to do readahead
via a the read stream interface.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2025-04-14 17:27:29 Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
Previous Message Konstantin Osipov 2025-04-14 17:15:23 Built-in Raft replication