Quick Links

Re: Optimize single tuple fetch from nbtree index

From:	Floris Van Nee <florisvannee(at)Optiver(dot)com>
To:	Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Optimize single tuple fetch from nbtree index
Date:	2019-08-27 07:23:18
Message-ID:	1566890597643.9842@Optiver.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

>> It seems that it contradicts the very idea of your patch, so probably we
>> should look for other ways to optimize this use-case.
>> Maybe this restriction can be relaxed for write only tables, that never
>> have to reread the page because of visibility, or something like that.
>> Also we probably can add to IndexScanDescData info about expected number
>> of tuples, to allow index work more optimal
>> and avoid the overhead for other loads.=

> The idea of the patch is exactly to relax this limitation. I forgot to update that README file though. The current implementation of the patch should be correct like this - that's why I added the look-back code on the page if the tuple couldn't be found anymore on the same location on the page. Similarly, it'll look on the page to the right if it detected a page split. These two measures combined should give a correct implementation of the 'it's possible that a scan stops in the middle of a page' relaxation. However, as Peter and Tom pointed out earlier, they feel that the performance advantage that this approach gives, does not outweigh the extra complexity at this time. I'd be open to other suggestions though.

Although now that I think of it - do you mean the case where the tuple that we returned to the caller after _bt_first actually gets deleted (not moved) from the page? I guess that can theoretically happen if _bt_first returns a non-visible tuple (but not DEAD yet in the index at the time of _bt_first). For my understanding, would a situation like the following lead to this (in my patch)?
1) Backend 1 does an index scan and returns the first tuple on _bt_first - this tuple is actually deleted in the heap already, however it's not marked dead yet in the index.
2) Backend 1 does a heap fetch to check actual visibility and determines the tuple is actually dead
3) While backend 1 is busy doing the heap fetch (so in between _bt_first and _bt_next) backend 2 comes in and manages to somehow do 1) a _bt_killitems on the page to mark tuples dead as well as 2) compact items on the page, thereby actually removing this item from the page.
4) Now backend 1 tries to find the next tuple in _bt_next - it first tries to locate the tuple where it left off, but cannot find it anymore because it got removed completely by backend 2.

If this is indeed possible then it's a bad issue unfortunately, and quite hard to try to reproduce, as a lot of things need to happen concurrently while doing a visiblity check.

As for your patch, I've had some time to take a look at it. For the two TODOs:

+ /* TODO Is it possible that currPage is not valid anymore? */
+ Assert(BTScanPosIsValid(so->currPos))

This Assert exists already a couple of lines earlier at the start of this function.

+ * TODO It is not clear to me
+ * why to check scanpos validity based on currPage value.
+ * I wonder, if we need currPage at all? Is there any codepath that
+ * assumes that currPage is not the same as BufferGetBlockNumber(buf)?
+ */

The comments in the source mention the following about this:
* We note the buffer's block number so that we can release the pin later.
* This allows us to re-read the buffer if it is needed again for hinting.
*/
so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);

As we figured out earlier, so->currPos.buf gets set to invalid when we release the pin by the unpin macro. So, if we don't store currPage number somewhere else, we cannot obtain the pin again if we need it during killitems. I think that's the reason that currPage is stored.

Other than the two TODOs in the code, I think the comments really help clarifying what's going on in the code - I'd be happy if this gets added.

-Floris

In response to

Re: Optimize single tuple fetch from nbtree index at 2019-08-26 20:22:35 from Floris Van Nee

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Richard Guo	2019-08-27 07:56:59	Re: A problem about partitionwise join
Previous Message	Michael Paquier	2019-08-27 07:04:48	Re: Fault injection framework