Quick Links

Re: BitmapHeapScan streaming read user and prelim refactoring

From:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc:	Melanie Plageman <melanieplageman(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Subject:	Re: BitmapHeapScan streaming read user and prelim refactoring
Date:	2024-03-29 21:39:05
Message-ID:	CA+hUKGKxPEh4oUtz_9fij=DdrJRWejbf1r0qWyqpSxUe2tu3VA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sat, Mar 30, 2024 at 4:53 AM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> Two observations:
>
> * The combine limit seems to have negligible impact. There's no visible
> difference between combine_limit=8kB and 128kB.
>
> * Parallel queries seem to work about the same as master (especially for
> optimal cases, but even for not optimal ones).
>
>
> The optimal plans with kernel readahead (two charts in the first row)
> look fairly good. There are a couple regressed cases, but a bunch of
> faster ones too.

Thanks for doing this!

> The optimal plans without kernel read ahead (two charts in the second
> row) perform pretty poorly - there are massive regressions. But I think
> the obvious reason is that the streaming read API skips prefetches for
> sequential access patterns, relying on kernel to do the readahead. But
> if the kernel readahead is disabled for the device, that obviously can't
> happen ...

Right, it does seem that this whole concept is sensitive on the
'borderline' between sequential and random, and this patch changes
that a bit and we lose some. It's becoming much clearer to me that
master is already exposing weird kinks, and the streaming version is
mostly better, certainly on low IOPS systems. I suspect that there
must be queries in the wild that would run much faster with eic=0 than
eic=1 today due to that, and while the streaming version also loses in
some cases, it seems that it mostly loses because of not triggering
RA, which can at least be improved by increasing the RA window. On
the flip side, master is more prone to running out of IOPS and there
is no way to tune your way out of that.

> I think the question is how much we can (want to) rely on the readahead
> to be done by the kernel. ...

We already rely on it everywhere, for basic things like sequential scan.

> ... Maybe there should be some flag to force
> issuing fadvise even for sequential patterns, perhaps at the tablespace
> level? ...

Yeah, I've wondered about trying harder to "second guess" the Linux
RA. At the moment, read_stream.c detects *exactly* sequential reads
(see seq_blocknum) to suppress advice, but if we knew/guessed the RA
window size, we could (1) detect it with the same window that Linux
will use to detect it, and (2) [new realisation from yesterday's
testing] we could even "tickle" it to wake it up in certain cases
where it otherwise wouldn't, by temporarily using a smaller
io_combine_limit if certain patterns come along. I think that sounds
like madness (I suspect that any place where the latter would help is
a place where you could turn RA up a bit higher for the same effect
without weird kludges), or another way to put it would be to call it
"overfitting" to the pre-existing quirks; but maybe it's a future
research idea...

> I don't recall seeing a system with disabled readahead, but I'm
> sure there are cases where it may not really work - it clearly can't
> work with direct I/O, ...

Right, for direct I/O everything is slow right now including seq scan.
We need to start asynchronous reads in the background (imagine
literally just a bunch of background "I/O workers" running preadv() on
your behalf to get your future buffers ready for you, or equivalently
Linux io_uring). That's the real goal of this project: restructuring
so we have the information we need to do that, ie teach every part of
PostgreSQL to predict the future in a standard and centralised way.
Should work out better than RA heuristics, because we're not just
driving in a straight line, we can turn corners too.

> ... but I've also not been very successful with
> prefetching on ZFS.

posix_favise() did not do anything in OpenZFS before 2.2, maybe you
have an older version?

> I certainly admit the data sets are synthetic and perhaps adversarial.
> My intent was to cover a wide range of data sets, to trigger even less
> common cases. It's certainly up to debate how serious the regressions on
> those data sets are in practice, I'm not suggesting "this strange data
> set makes it slower than master, so we can't commit this".

Right, yeah. Thanks! Your initial results seemed discouraging, but
looking closer I'm starting to feel a lot more positive about
streaming BHS.

In response to

Re: BitmapHeapScan streaming read user and prelim refactoring at 2024-03-29 15:52:58 from Tomas Vondra

Responses

Re: BitmapHeapScan streaming read user and prelim refactoring at 2024-03-29 22:03:20 from Thomas Munro
Re: BitmapHeapScan streaming read user and prelim refactoring at 2024-03-29 23:40:01 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Mitar	2024-03-29 21:50:42	Re: Allowing DESC for a PRIMARY KEY column
Previous Message	Jacob Champion	2024-03-29 21:21:07	Re: WIP Incremental JSON Parser