Re: Index Skip Scan (new UniqueKeys)

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: Index Skip Scan (new UniqueKeys)
Date: 2020-09-22 00:59:32
Message-ID: CAH2-Wz=z8yXENbUWgP+3PhfS7HfB9bj01XuuSY8bj4jasQj_3A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Aug 15, 2020 at 7:09 AM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
> Here is a new version that hopefully address most of the concerns
> mentioned in this thread so far. As before, first two patches are taken
> from UniqueKeys thread and attached only for the reference. List of
> changes includes:

Some thoughts on this version of the patch series (I'm focussing on
v36-0005-Btree-implementation-of-skipping.patch again):

* I see the following compiler warning:

/code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:
In function ‘populate_baserel_uniquekeys’:
/code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13:
warning: ‘expr’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
797 | else if (!list_member(unique_index->rel->reltarget->exprs, expr))
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* Perhaps the warning is related to this nearby code that I noticed
Valgrind complains about:

==1083468== VALGRINDERROR-BEGIN
==1083468== Invalid read of size 4
==1083468== at 0x59568A: get_exprs_from_uniqueindex (uniquekeys.c:771)
==1083468== by 0x593C5B: populate_baserel_uniquekeys (uniquekeys.c:140)
==1083468== by 0x56AEA5: set_plain_rel_size (allpaths.c:586)
==1083468== by 0x56AADB: set_rel_size (allpaths.c:412)
==1083468== by 0x56A8CD: set_base_rel_sizes (allpaths.c:323)
==1083468== by 0x56A5A7: make_one_rel (allpaths.c:185)
==1083468== by 0x5AB426: query_planner (planmain.c:269)
==1083468== by 0x5AF02C: grouping_planner (planner.c:2058)
==1083468== by 0x5AD202: subquery_planner (planner.c:1015)
==1083468== by 0x5ABABF: standard_planner (planner.c:405)
==1083468== by 0x5AB7F8: planner (planner.c:275)
==1083468== by 0x6E6F84: pg_plan_query (postgres.c:875)
==1083468== by 0x6E70C4: pg_plan_queries (postgres.c:966)
==1083468== by 0x6E7497: exec_simple_query (postgres.c:1158)
==1083468== by 0x6EBCD3: PostgresMain (postgres.c:4309)
==1083468== by 0x624284: BackendRun (postmaster.c:4541)
==1083468== by 0x623995: BackendStartup (postmaster.c:4225)
==1083468== by 0x61FB70: ServerLoop (postmaster.c:1742)
==1083468== by 0x61F309: PostmasterMain (postmaster.c:1415)
==1083468== by 0x514AF2: main (main.c:209)
==1083468== Address 0x75f13e0 is 4,448 bytes inside a block of size
8,192 alloc'd
==1083468== at 0x483B7F3: malloc (in
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==1083468== by 0x8C15C8: AllocSetAlloc (aset.c:919)
==1083468== by 0x8CEA52: palloc (mcxt.c:964)
==1083468== by 0x267F25: systable_beginscan (genam.c:373)
==1083468== by 0x8682CE: SearchCatCacheMiss (catcache.c:1359)
==1083468== by 0x868167: SearchCatCacheInternal (catcache.c:1299)
==1083468== by 0x867E2C: SearchCatCache1 (catcache.c:1167)
==1083468== by 0x8860B2: SearchSysCache1 (syscache.c:1123)
==1083468== by 0x8BD482: check_enable_rls (rls.c:66)
==1083468== by 0x68A113: get_row_security_policies (rowsecurity.c:134)
==1083468== by 0x683C2C: fireRIRrules (rewriteHandler.c:2045)
==1083468== by 0x687340: QueryRewrite (rewriteHandler.c:3962)
==1083468== by 0x6E6EB1: pg_rewrite_query (postgres.c:784)
==1083468== by 0x6E6D23: pg_analyze_and_rewrite (postgres.c:700)
==1083468== by 0x6E7476: exec_simple_query (postgres.c:1155)
==1083468== by 0x6EBCD3: PostgresMain (postgres.c:4309)
==1083468== by 0x624284: BackendRun (postmaster.c:4541)
==1083468== by 0x623995: BackendStartup (postmaster.c:4225)
==1083468== by 0x61FB70: ServerLoop (postmaster.c:1742)
==1083468== by 0x61F309: PostmasterMain (postmaster.c:1415)
==1083468==
==1083468== VALGRINDERROR-END

(You'll see the same error if you run Postgres Valgrind + "make
installcheck", though I don't think that the queries in question are
tests that you yourself wrote.)

* IndexScanDescData.xs_itup comments could stand to be updated here --
IndexScanDescData.xs_want_itup is no longer just about index-only
scans.

* Do we really need the AM-level boolean flag/argument named
"scanstart"? Why not just follow the example of btgettuple(), which
determines whether or not the scan has been initialized based on the
current scan position?

Just because you set so->currPos.buf to InvalidBuffer doesn't mean you
cannot or should not take the same approach as btgettuple(). And even
if you can't take exactly the same approach, I would still think that
the scan's opaque B-Tree state should remember if it's the first call
to _bt_skip() (rather than some subsequent call) in some other way
(e.g. carrying a "scanstart" bool flag directly).

A part of my objection to "scanstart" is that it seems to require that
much of the code within _bt_skip() get another level of
indentation...which makes it even more difficult to follow.

* I don't understand what _bt_scankey_within_page() comments mean when
they refer to "the page highkey". It looks like this function examines
the highest data item on the page, not the high key.

It is highly confusing to refer to a tuple as the page high key if it
isn't the tuple from the P_HIKEY offset number on a non-rightmost
page, which is a pivot tuple even on the leaf level (as indicated by
BTreeTupleIsPivot()).

* Why does _bt_scankey_within_page() have an unused "ScanDirection
dir" argument?

* Why is it okay to do anything important based on the
_bt_scankey_within_page() return value?

If the page is empty, then how can we know that it's okay to go to the
next value? I'm concerned that there could be subtle bugs in this
area. VACUUM will usually just delete the empty page. But it won't
always do so, for a variety of reasons that aren't worth going into
now. This could mask bugs in this area. I'm concerned about patterns
like this one from _bt_skip():

while (!nextFound)
{
....

if (_bt_scankey_within_page(scan, so->skipScanKey,
so->currPos.buf, dir))
{
...
}
else
/*
* If startItup could be not found within the current page,
* assume we found something new
*/
nextFound = true;
....
}

Why would you assume that "we found something new" here? In general I
just don't understand the design of _bt_skip(). I get the basic idea
of what you're trying to do, but it could really use better comments.

*The "jump one more time if it's the same as at the beginning" thing
seems scary to me. Maybe you should be doing something with the actual
high key here.

* Tip: You might find cases involving "empty but not yet deleted"
pages a bit easier to test by temporarily disabling page deletion. You
can modify nbtree.c to look like this:

index a1ad22f785..db977a0300 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1416,6 +1416,7 @@ backtrack:
Assert(!attempt_pagedel || nhtidslive == 0);
}

+ attempt_pagedel = false;
if (attempt_pagedel)
{
MemoryContext oldcontext;

That's all I have for now.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message tsunakawa.takay@fujitsu.com 2020-09-22 01:17:38 RE: Transactions involving multiple postgres foreign servers, take 2
Previous Message tsunakawa.takay@fujitsu.com 2020-09-22 00:47:52 RE: Global snapshots