Re: index prefetching

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2024-02-15 01:37:54
Message-ID: CAH2-Wz=8kzZCyJ46X9rA-Un1hvjdHj5gDgmOfZ5cRQBvRarJYg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 14, 2024 at 7:28 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2024-02-13 14:54:14 -0500, Peter Geoghegan wrote:
> > This property of index scans is fundamental to how index scans work.
> > Pinning an index page as an interlock against concurrently TID
> > recycling by VACUUM is directly described by the index API docs [1],
> > even (the docs actually use terms like "buffer pin" rather than
> > something more abstract sounding). I don't think that anything
> > affecting that behavior should be considered an implementation detail
> > of the nbtree index AM as such (nor any particular index AM).
>
> Given that the interlock is only needed for non-mvcc scans, that non-mvcc
> scans are rare due to catalog accesses using snapshots these days and that
> most non-mvcc scans do single-tuple lookups, it might be viable to be more
> restrictive about prefetching iff non-mvcc snapshots are in use and to use
> method of cleanup that allows multiple pages to be cleaned up otherwise.

I agree, but don't think that it matters all that much.

If you have an MVCC snapshot, that doesn't mean that TID recycle
safety problems automatically go away. It only means that you have one
known and supported alternative approach to dealing with such
problems. It's not like you just get that for free, just by using an
MVCC snapshot, though -- it has downsides. Downsides such as the
current _bt_killitems() behavior with a concurrently-modified leaf
page (modified when we didn't hold a leaf page pin). It'll just give
up on setting any LP_DEAD bits due to noticing that the leaf page's
LSN changed. (Plus there are implementation restrictions that I won't
repeat again now.)

When I refer to the buffer pin interlock, I'm mostly referring to the
general need for something like that in the context of index scans.
Principally in order to make kill_prior_tuple continue to work in
something more or less like its current form.

> However, I don't think we would necessarily have to relax the IAM pinning
> rules, just to be able to do prefetching of more than one index leaf
> page.

To be clear, we already do relax the IAM pinning rules. Or at least
nbtree selectively opts out, as I've gone into already.

> Restricting prefetching to entries within a single leaf page obviously
> has the disadvantage of not being able to benefit from concurrent IO whenever
> crossing a leaf page boundary, but at the same time processing entries from
> just two leaf pages would often allow for a sufficiently aggressive
> prefetching. Pinning a small number of leaf pages instead of a single leaf
> page shouldn't be a problem.

You're probably right. I just don't see any need to solve that problem in v1.

> One argument for loosening the tight coupling between kill_prior_tuples and
> index scan progress is that the lack of kill_prior_tuples for bitmap scans is
> quite problematic. I've seen numerous production issues with bitmap scans
> caused by subsequent scans processing a growing set of dead tuples, where
> plain index scans were substantially slower initially but didn't get much
> slower over time.

I've seen production issues like that too. No doubt it's a problem.

> We might be able to design a system where the bitmap
> contains a certain number of back-references to the index, allowing later
> cleanup if there weren't any page splits or such.

That does seem possible, but do you really want a design for index
prefetching that relies on that massive enhancement (a total redesign
of kill_prior_tuple) happening at some point in the not-too-distant
future? Seems risky, from a project management point of view.

This back-references idea seems rather complicated, especially if it
needs to work with very large bitmap index scans. Since you'll still
have the basic problem of TID recycle safety to deal with (even with
an MVCC snapshot), you don't just have to revisit the leaf pages. You
also have to revisit the corresponding heap pages (generally they'll
be a lot more numerous than leaf pages). You'll have traded one
problem for another (which is not to say that it's not a good
trade-off).

Right now the executor uses a amgettuple interface, and knows nothing
about index related costs (e.g., pages accessed in any index, index
qual costs). While the index AM has some limited understanding of heap
access costs. So the index AM kinda knows a small bit about both types
of costs (possibly not enough, but something). That informs the
language I'm using to describe all this.

To do something like your "back-references to the index" thing well, I
think that you need more dynamic behavior around when you visit the
heap to get heap tuples pointed to by TIDs from index pages (i.e.
dynamic behavior that determines how many leaf pages to go before
going to the heap to get pointed-to TIDs). That is basically what I
meant by "put the index AM in control" -- it doesn't *strictly*
require that the index AM actually do that. Just that a single piece
of code has to have access to the full context, in order to make the
right trade-offs around how both index and heap accesses are
scheduled.

> > I think that it makes sense to put the index AM in control here --
> > that almost follows from what I said about the index AM API. The index
> > AM already needs to be in control, in about the same way, to deal with
> > kill_prior_tuple (plus it helps with the LIMIT issue I described).
>
> Depending on what "control" means I'm doubtful:
>
> Imo there are decisions influencing prefetching that an index AM shouldn't
> need to know about directly, e.g. how the plan shape influences how many
> tuples are actually going to be consumed. Of course that determination could
> be made in planner/executor and handed to IAMs, for the IAM to then "control"
> the prefetching.

I agree with all this.

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2024-02-15 01:48:15 Re: Add system identifier to backup manifest
Previous Message Andres Freund 2024-02-15 00:28:51 Re: index prefetching