Re: index prefetching

From: Andres Freund <andres(at)anarazel(dot)de>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2024-02-14 23:43:02
Message-ID: 20240214234302.hk5e3otfssasxnac@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2024-02-14 16:45:57 -0500, Melanie Plageman wrote:
> > > The LIMIT problem is not very clear to me either. Yes, if we get close
> > > to the end of the leaf page, we may need to visit the next leaf page.
> > > But that's kinda the whole point of prefetching - reading stuff ahead,
> > > and reading too far ahead is an inherent risk. Isn't that a problem we
> > > have even without LIMIT? The prefetch distance ramp up is meant to limit
> > > the impact.
> >
> > Right now, the index AM doesn't know anything about LIMIT at all. That
> > doesn't matter, since the index AM can only read/scan one full leaf
> > page before returning control back to the executor proper. The
> > executor proper can just shut down the whole index scan upon finding
> > that we've already returned N tuples for a LIMIT N.
> >
> > We don't do prefetching right now, but we also don't risk reading a
> > leaf page that'll just never be needed. Those two things are in
> > tension, but I don't think that that's quite the same thing as the
> > usual standard prefetching tension/problem. Here there is uncertainty
> > about whether what we're prefetching will *ever* be required -- not
> > uncertainty about when exactly it'll be required. (Perhaps this
> > distinction doesn't mean much to you. I'm just telling you how I think
> > about it, in case it helps move the discussion forward.)
>
> I don't think that the LIMIT problem is too different for index scans
> than heap scans. We will need some advice from planner to come down to
> prevent over-eager prefetching in all cases.

I'm not sure that that's really true. I think the more common and more
problematic case for partially executing a sub-tree of a query are nested
loops (worse because that happens many times within a query). Particularly for
anti-joins prefetching too aggressively could lead to a significant IO
amplification.

At the same time it's IMO more important to ramp up prefetching distance
fairly aggressively for index scans than it is for sequential scans. For
sequential scans it's quite likely that either the whole scan takes quite a
while (thus slowly ramping doesn't affect overall time that much) or that the
data is cached anyway because the tables are small and frequently used (in
which case we don't need to ramp). And even if smaller tables aren't cached,
because it's sequential IO, the IOs are cheaper as they're sequential.
Contrast that to index scans, where it's much more likely that you have cache
misses in queries that do an overall fairly small number of IOs and where that
IO is largely random.

I think we'll need some awareness at ExecInitNode() time about how the results
of the nodes are used. I see a few "classes":

1) All rows are needed, because the node is below an Agg, Hash, Materialize,
Sort, .... Can be determined purely by the plan shape.

2) All rows are needed, because the node is completely consumed by the
top-level (i.e. no limit, anti-joins or such inbetween) and the top-level
wants to run the whole query. Unfortunately I don't think we know this at
plan time at the moment (it's just determined by what's passed to
ExecutorRun()).

3) Some rows are needed, but it's hard to know the precise number. E.g. because
of a LIMIT further up.

4) Only a single row is going to be needed, albeit possibly after filtering on
the node level. E.g. the anti-join case.

There are different times at which we could determine how each node is
consumed:

a) Determine node consumption "class" purely within ExecInit*, via different
eflags.

Today that couldn't deal with 2), but I think it'd not too hard to modify
callers that consume query results completely to tell that ExecutorStart(),
not just ExecutorRun().

A disadvantage would be that this prevents us from taking IO depth into
account during costing. There very well might be plans that are cheaper
than others because the plan shape allows more concurrent IO.

b) Determine node consumption class at plan time.

This also couldn't deal with 2), but fixing that probably would be harder,
because we'll often not know at plan time how the query will be
executed. And in fact the same plan might be executed multiple ways, in
case of prepared statements.

The obvious advantage is of course that we can influence the choice of
paths.

I suspect we'd eventually want a mix of both. Plan time to be able to
influence plan shape, ExecInit* to deal with not knowing how the query will be
consumed at plan time. Which suggests that we could start with whichever is
easier and extend later.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2024-02-15 00:28:51 Re: index prefetching
Previous Message Nathan Bossart 2024-02-14 23:23:36 Re: common signal handler protection