Quick Links

Re: BitmapHeapScan streaming read user and prelim refactoring

From:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To:	James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>
Cc:	Melanie Plageman <melanieplageman(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: BitmapHeapScan streaming read user and prelim refactoring
Date:	2025-04-11 03:14:50
Message-ID:	CA+hUKGKi8WG1HEZAQBC8PJrmfaf+mLug3PN3ytqxKYm5ghEwCA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Apr 11, 2025 at 5:50 AM James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com> wrote:
> I am looking at the pre-streaming code, in PG 17, as I am not familiar
> with the PG 18 "streaming" code. Back in PG 17, nodeBitmapHeapscan.c
> maintained two shared TBM iterators, for PQ. One of the iterators was
> the actual, "fetch" iterator; the other was the "prefetch" iterator,
> which kept some distance ahead of the "fetch" iterator (to hide read
> latency).

We're talking at cross-purposes.

The new streaming BHS isn't just issuing probabilistic hints about
future access obtained from a second iterator. It has just one shared
iterator connected up to the workers' ReadStreams. Each worker pulls
a disjoint set of blocks out of its stream, possibly running a bunch
of IOs in the background as required. The stream replaces the old
ReadBuffer() call, and the old PrefetchBuffer() call and a bunch of
dubious iterator synchronisation logic are deleted. These are now
real IOs running in the background and for the *exact* blocks you will
consume; posix_fadvise() was just a stepping towards AIO that
tolerated sloppy synchronisation including being entirely wrong. If
you additionally teach the iterator to work in batches, as my 0001
patch (which I didn't propose for v18) showed, then one worker might
end up processing (say) 10 blocks at end-of-scan while all the other
workers have finished the node, and maybe the whole query. That'd be
unfair. "Ramp-down" ... 8, 4, 2, 1 has been used in one or two other
places in parallel-aware nodes with internal batching as a kind of
fudge to help them finish CPU work around the same time if you're
lucky, and my 0002 patch shows that NOT working here. I suspect the
concept itself is defunct: it no longer narrows the CPU work
completion time range across workers at all well due to the elastic
streams sitting in between. Any naive solution that requires
cooperation/waiting for another worker to hand over final scraps of
work originally allocated to it (and I don't mean the IO completion
part, that all just works just fine as you say, a lot of engineering
went into the buffer manager to make that true, for AIO but also in
the preceding decades... what I mean here is: how do you even know
which block to read?) is probably a deadlock risk. Essays have been
written on the topic if you are interested.

All the rest of our conversation makes no sense without that context :-)

> > I admit this all sounds kinda complicated and maybe there is a much
> > simpler way to achieve the twin goals of maximising I/O combining AND
> > parallel query fairness.
>
> I tend to think that the two goals are so much in conflict, that it's
> not worth trying to apply cleverness to get them to agree on things...

I don't give up so easily :-)

In response to

Re: BitmapHeapScan streaming read user and prelim refactoring at 2025-04-10 17:50:12 from James Hunter

Responses

Re: BitmapHeapScan streaming read user and prelim refactoring at 2025-04-14 16:58:19 from James Hunter
Re: BitmapHeapScan streaming read user and prelim refactoring at 2025-04-14 17:44:40 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Smith	2025-04-11 03:14:55	Re: Proposal: Filter irrelevant change before reassemble transactions during logical decoding
Previous Message	Andres Freund	2025-04-11 02:06:32	Re: MergeJoin beats HashJoin in the case of multiple hash clauses