From: | Melanie Plageman <melanieplageman(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie> |
Subject: | Re: index prefetching |
Date: | 2024-02-14 16:40:18 |
Message-ID: | CAAKRu_bawDLwtFrMHpVAfBbyw0tFU1kEkPv+R3yQpeS5T+niXQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Feb 13, 2024 at 2:01 PM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
> On 2/7/24 22:48, Melanie Plageman wrote:
> > ...
Issues
> > ---
> > - kill prior tuple
> >
> > This optimization doesn't work with index prefetching with the current
> > design. Kill prior tuple relies on alternating between fetching a
> > single index tuple and visiting the heap. After visiting the heap we
> > can potentially kill the immediately preceding index tuple. Once we
> > fetch multiple index tuples, enqueue their TIDs, and later visit the
> > heap, the next index page we visit may not contain all of the index
> > tuples deemed killable by our visit to the heap.
> >
>
> I admit I haven't thought about kill_prior_tuple until you pointed out.
> Yeah, prefetching separates (de-synchronizes) the two scans (index and
> heap) in a way that prevents this optimization. Or at least makes it
> much more complex :-(
>
> > In our case, we could try and fix this by prefetching only heap blocks
> > referred to by index tuples on the same index page. Or we could try
> > and keep a pool of index pages pinned and go back and kill index
> > tuples on those pages.
> >
>
> I think restricting the prefetching to a single index page would not be
> a huge issue performance-wise - that's what the initial patch version
> (implemented at the index AM level) did, pretty much. The prefetch queue
> would get drained as we approach the end of the index page, but luckily
> index pages tend to have a lot of entries. But it'd put an upper bound
> on the prefetch distance (much lower than the e_i_c maximum 1000, but
> I'd say common values are 10-100 anyway).
>
> But how would we know we're on the same index page? That knowledge is
> not available outside the index AM - the executor or indexam.c does not
> know this, right? Presumably we could expose this, somehow, but it seems
> like a violation of the abstraction ...
The easiest way to do this would be to have the index AM amgettuple()
functions set a new member in the IndexScanDescData which is either
the index page identifier or a boolean that indicates we have moved on
to the next page. Then, when filling the queue, we would stop doing so
when the page switches. Now, this wouldn't really work for the first
index tuple on each new page, so, perhaps we would need the index AMs
to implement some kind of "peek" functionality.
Or, we could provide the index AM with a max queue size and allow it
to fill up the queue with the TIDs it wants (which it could keep to
the same index page). And, for the index-only scan case, could have
some kind of flag which indicates if the caller is putting
TIDs+HeapTuples or TIDS+IndexTuples on the queue, which might reduce
the amount of space we need. I'm not sure who manages the memory here.
I wasn't quite sure how we could use
index_compute_xid_horizon_for_tuples() for inspiration -- per Peter's
suggestion. But, I'd like to understand.
> > - switching scan directions
> >
> > If the index scan switches directions on a given invocation of
> > IndexNext(), heap blocks may have already been prefetched and read for
> > blocks containing tuples beyond the point at which we want to switch
> > directions.
> >
> > We could fix this by having some kind of streaming read "reset"
> > callback to drop all of the buffers which have been prefetched which
> > are now no longer needed. We'd have to go backwards from the last TID
> > which was yielded to the caller and figure out which buffers in the
> > pgsr buffer ranges are associated with all of the TIDs which were
> > prefetched after that TID. The TIDs are in the per_buffer_data
> > associated with each buffer in pgsr. The issue would be searching
> > through those efficiently.
> >
>
> Yeah, that's roughly what I envisioned in one of my previous messages
> about this issue - walking back the TIDs read from the index and added
> to the prefetch queue.
>
> > The other issue is that the streaming read API does not currently
> > support backwards scans. So, if we switch to a backwards scan from a
> > forwards scan, we would need to fallback to the non streaming read
> > method. We could do this by just setting the TID queue size to 1
> > (which is what I have currently implemented). Or we could add
> > backwards scan support to the streaming read API.
> >
>
> What do you mean by "support for backwards scans" in the streaming read
> API? I imagined it naively as
>
> 1) drop all requests in the streaming read API queue
>
> 2) walk back all "future" requests in the TID queue
>
> 3) start prefetching as if from scratch
>
> Maybe there's a way to optimize this and reuse some of the work more
> efficiently, but my assumption is that the scan direction does not
> change very often, and that we process many items in between.
Yes, the steps you mention for resetting the queues make sense. What I
meant by "backwards scan is not supported by the streaming read API"
is that Thomas/Andres had mentioned that the streaming read API does
not support backwards scans right now. Though, since the callback just
returns a block number, I don't know how it would break.
When switching between a forwards and backwards scan, does it go
backwards from the current position or start at the end (or beginning)
of the relation? If it is the former, then the blocks would most
likely be in shared buffers -- which the streaming read API handles.
It is not obvious to me from looking at the code what the gap is, so
perhaps Thomas could weigh in.
As for handling this in index prefetching, if you think a TID queue
size of 1 is a sufficient fallback method, then resetting the pgsr
queue and resizing the TID queue to 1 would work with no issues. If
the fallback method requires the streaming read code path not be used
at all, then that is more work.
> > - multiple executions
> >
> > For reasons I don't entirely understand yet, multiple executions (not
> > rescan -- see ExecutorRun(...execute_once)) do not work. As in Tomas'
> > patch, I have disabled prefetching (and made the TID queue size 1)
> > when execute_once is false.
> >
>
> Don't work in what sense? What is (not) happening?
I got wrong results for this. I'll have to do more investigation, but
I assumed that not resetting the TID queue and pgsr queue was also the
source of this issue.
What I imagined we would do is figure out if there is a viable
solution for the larger design issues and then investigate what seemed
like smaller issues. But, perhaps I should dig into this first to
ensure there isn't a larger issue.
> > - Index Only Scans need to return IndexTuples
> >
> > Because index only scans return either the IndexTuple pointed to by
> > IndexScanDesc->xs_itup or the HeapTuple pointed to by
> > IndexScanDesc->xs_hitup -- both of which are populated by the index
> > AM, we have to save copies of those IndexTupleData and HeapTupleDatas
> > for every TID whose block we prefetch.
> >
> > This might be okay, but it is a bit sad to have to make copies of those tuples.
> >
> > In this patch, I still haven't figured out the memory management part.
> > I copy over the tuples when enqueuing a TID queue item and then copy
> > them back again when the streaming read API returns the
> > per_buffer_data to us. Something is still not quite right here. I
> > suspect this is part of the reason why some of the other tests are
> > failing.
> >
>
> It's not clear to me what you need to copy the tuples back - shouldn't
> it be enough to copy the tuple just once?
When enqueueing it, IndexTuple has to be copied from the scan
descriptor to somewhere in memory with a TIDQueueItem pointing to it.
Once we do this, the IndexTuple memory should stick around until we
free it, so yes, I'm not sure why I was seeing the IndexTuple no
longer be valid when I tried to put it in a slot. I'll have to do more
investigation.
> FWIW if we decide to pin multiple index pages (to make kill_prior_tuple
> work), that would also mean we don't need to copy any tuples, right? We
> could point into the buffers for all of them, right?
Yes, this would be a nice benefit.
> > Other issues/gaps in my implementation:
> >
> > Determining where to allocate the memory for the streaming read object
> > and the TID queue is an outstanding TODO. To implement a fallback
> > method for cases in which streaming read doesn't work, I set the queue
> > size to 1. This is obviously not good.
> >
>
> I think IndexFetchTableData seems like a not entirely terrible place for
> allocating the pgsr, but I wonder what Andres thinks about this. IIRC he
> advocated for doing the prefetching in executor, and I'm not sure
> heapam_handled.c + relscan.h is what he imagined ...
>
> Also, when you say "obviously not good" - why? Are you concerned about
> the extra overhead of shuffling stuff between queues, or something else?
Well, I didn't resize the queue, I just limited how much of it we can
use to a single member (thus wasting the other memory). But resizing a
queue isn't free either. Also, I wondered if a queue size of 1 for
index AMs using the fallback method is too confusing (like it is a
fake queue?). But, I'd really, really rather not maintain both a queue
and non-queue control flow for Index[Only]Next(). The maintenance
overhead seems like it would outweigh the potential downsides.
> > Right now, I allocate the TID queue and streaming read objects in
> > IndexNext() and IndexOnlyNext(). This doesn't seem ideal. Doing it in
> > index_beginscan() (and index_beginscan_parallel()) is tricky though
> > because we don't know the scan direction at that point (and the scan
> > direction can change). There are also callers of index_beginscan() who
> > do not call Index[Only]Next() (like systable_getnext() which calls
> > index_getnext_slot() directly).
> >
>
> Yeah, not sure this is the right layering ... the initial patch did
> everything in individual index AMs, then it moved to indexam.c, then to
> executor. And this seems to move it to lower layers again ...
If we do something like make the index AM responsible for the TID
queue (as mentioned above as a potential solution to the kill prior
tuple issue), then we might be able to allocate the TID queue in the
index AMs?
As for the streaming read object, if we were able to solve the issue
where callers of index_beginscan() don't call Index[Only]Next() (and
thus shouldn't allocate a streaming read object), then it seems easy
enough to move the streaming read object allocation into the table
AM-specific begin scan method.
> > Also, my implementation does not yet have the optimization Tomas does
> > to skip prefetching recently prefetched blocks. As he has said, it
> > probably makes sense to add something to do this in a lower layer --
> > such as in the streaming read API or even in bufmgr.c (maybe in
> > PrefetchSharedBuffer()).
> >
>
> I agree this should happen in lower layers. I'd probably do this in the
> streaming read API, because that would define "scope" of the cache
> (pages prefetched for that read). Doing it in PrefetchSharedBuffer seems
> like it would do a single cache (for that particular backend).
Hmm. I wonder if there are any upsides to having the cache be
per-backend. Though, that does sound like a whole other project...
- Melanie
From | Date | Subject | |
---|---|---|---|
Next Message | Mark Dilger | 2024-02-14 16:41:20 | Re: BitmapHeapScan streaming read user and prelim refactoring |
Previous Message | Jelte Fennema-Nio | 2024-02-14 16:35:13 | Add trim_trailing_whitespace to editorconfig file |