From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com> |
Cc: | Melanie Plageman <melanieplageman(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: BitmapHeapScan streaming read user and prelim refactoring |
Date: | 2025-04-11 03:14:50 |
Message-ID: | CA+hUKGKi8WG1HEZAQBC8PJrmfaf+mLug3PN3ytqxKYm5ghEwCA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Apr 11, 2025 at 5:50 AM James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com> wrote:
> I am looking at the pre-streaming code, in PG 17, as I am not familiar
> with the PG 18 "streaming" code. Back in PG 17, nodeBitmapHeapscan.c
> maintained two shared TBM iterators, for PQ. One of the iterators was
> the actual, "fetch" iterator; the other was the "prefetch" iterator,
> which kept some distance ahead of the "fetch" iterator (to hide read
> latency).
We're talking at cross-purposes.
The new streaming BHS isn't just issuing probabilistic hints about
future access obtained from a second iterator. It has just one shared
iterator connected up to the workers' ReadStreams. Each worker pulls
a disjoint set of blocks out of its stream, possibly running a bunch
of IOs in the background as required. The stream replaces the old
ReadBuffer() call, and the old PrefetchBuffer() call and a bunch of
dubious iterator synchronisation logic are deleted. These are now
real IOs running in the background and for the *exact* blocks you will
consume; posix_fadvise() was just a stepping towards AIO that
tolerated sloppy synchronisation including being entirely wrong. If
you additionally teach the iterator to work in batches, as my 0001
patch (which I didn't propose for v18) showed, then one worker might
end up processing (say) 10 blocks at end-of-scan while all the other
workers have finished the node, and maybe the whole query. That'd be
unfair. "Ramp-down" ... 8, 4, 2, 1 has been used in one or two other
places in parallel-aware nodes with internal batching as a kind of
fudge to help them finish CPU work around the same time if you're
lucky, and my 0002 patch shows that NOT working here. I suspect the
concept itself is defunct: it no longer narrows the CPU work
completion time range across workers at all well due to the elastic
streams sitting in between. Any naive solution that requires
cooperation/waiting for another worker to hand over final scraps of
work originally allocated to it (and I don't mean the IO completion
part, that all just works just fine as you say, a lot of engineering
went into the buffer manager to make that true, for AIO but also in
the preceding decades... what I mean here is: how do you even know
which block to read?) is probably a deadlock risk. Essays have been
written on the topic if you are interested.
All the rest of our conversation makes no sense without that context :-)
> > I admit this all sounds kinda complicated and maybe there is a much
> > simpler way to achieve the twin goals of maximising I/O combining AND
> > parallel query fairness.
>
> I tend to think that the two goals are so much in conflict, that it's
> not worth trying to apply cleverness to get them to agree on things...
I don't give up so easily :-)
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Smith | 2025-04-11 03:14:55 | Re: Proposal: Filter irrelevant change before reassemble transactions during logical decoding |
Previous Message | Andres Freund | 2025-04-11 02:06:32 | Re: MergeJoin beats HashJoin in the case of multiple hash clauses |