Re: Index Skip Scan

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, florisvannee(at)optiver(dot)com, pg(at)bowt(dot)ie, jesper(dot)pedersen(at)redhat(dot)com, david(dot)rowley(at)2ndquadrant(dot)com, alvherre(at)2ndquadrant(dot)com, thomas(dot)munro(at)gmail(dot)com, jtc331(at)gmail(dot)com, rafia(dot)pghackers(at)gmail(dot)com, jeff(dot)janes(at)gmail(dot)com, bhushan(dot)uparkar(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, a(dot)korotkov(at)postgrespro(dot)ru
Subject: Re: Index Skip Scan
Date: 2020-02-08 16:59:37
Message-ID: 20200208165937.mtnpmawawfzgm4sb@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Feb 08, 2020 at 04:24:40PM +0100, Dmitry Dolgov wrote:
>> On Sat, Feb 08, 2020 at 03:22:17PM +0100, Tomas Vondra wrote:
>>
>> I've done some testing and benchmarking of the v31 patch, looking for
>> regressions, costing issues etc. Essentially, I've ran a bunch of SELECT
>> DISTINCT queries on data sets of various size, number of distinct values
>> etc. The results are fairly large, so I've uploaded them to github
>>
>> https://github.com/tvondra/skip-scan-test
>
>Thanks a lot for testing!
>
>> There are a couple of regressions, where the plan with skipscan enables
>> is ~10x slower. But this seems to happen only in high-cardinality cases
>> where we misestimate the number of groups. Consider a table with two
>> independent columns
>>
>> CREATE TABLE t (a text, b text);
>> INSERT INTO t SELECT
>> md5((10000*random())::int::text),
>> md5((10000*random())::int::text)
>> FROM generate_series(1,1000000) s(i);
>>
>> CREATE INDEX ON t(a,b);
>>
>> ANALYZE;
>>
>> which then behaves like this:
>>
>> test=# select * from (select distinct a,b from t) foo offset 10000000;
>> Time: 3138.222 ms (00:03.138)
>> test=# set enable_indexskipscan = off;
>> Time: 0.312 ms
>> test=# select * from (select distinct a,b from t) foo offset 10000000;
>> Time: 199.749 ms
>>
>> So in this case the skip scan is ~15x slower than the usual plan (index
>> only scan + unique). The reason why this happens is pretty simple - to
>> estimate the number of groups we multiply the ndistinct estimates for
>> the two columns (which both have n_distinct = 10000), but then we cap
>> the estimate to 10% of the table. But when the columns are independent
>> with high cardinalities that under-estimates the actual value, making
>> the cost for skip scan much lower than it should be.
>
>The current implementation checks if we can find the next value on the
>same page to do a shortcut instead of tree traversal and improve such
>kind of situations, but I can easily imagine that it's still not enough
>in some extreme situations.
>

Yeah. I'm not sure there's room for further improvements. The regressed
cases were subject to the 10% cap, and with ndistinct being more than
10% of the table, we probably can find many distinct keys on each index
page - we know that every ~10 rows the values change.

>> I don't think this is an issue the skipscan patch needs to fix, though.
>> Firstly, the regressed cases are a tiny minority. Secondly, we already
>> have a way to improve the root cause - creating extended stats with
>> ndistinct coefficients generally makes the problem go away.
>
>Yes, I agree.
>
>> One interesting observation however is that this regression only
>> happened with text columns but not with int or bigint. My assumption is
>> that this is due to text comparisons being much more expensive. Not sure
>> if there is something we could do to deal with this - reduce the number
>> of comparisons or something?
>
>Hm, interesting. I need to check that we do not do any unnecessary
>comparisons.
>
>> On Sat, Feb 08, 2020 at 02:11:59PM +0100, Tomas Vondra wrote:
>> > Yes, I've mentioned that already in one of the previous emails :) The
>> > simplest way I see to achieve what we want is to do something like in
>> > attached modified version with a new hasDeclaredCursor field. It's not a
>> > final version though, but posted just for discussion, so feel free to
>> > suggest any improvements or alternatives.
>>
>> IMO the proper fix for this case (moving forward, reading backwards) is
>> simply making it work by properly checking deleted tuples etc. Not sure
>> why that would be so much complex (haven't tried implementing it)?
>
>It's probably not that complex by itself, but requires changing
>responsibilities isolation. At the moment current implementation leaves
>jumping over a tree fully to _bt_skip, and heap visibility checks only
>to IndexOnlyNext. To check deleted tuples properly we need to either
>verify a corresponding heap tuple visibility inside _bt_skip (as I've
>mentioned in one of the previous emails, checking if an index tuple is
>dead is not enough), or teach the code in IndexOnlyNext to understand
>that _bt_skip can lead to returning the same tuple while moving forward
>& reading backward. Do you think it's still makes sense to go this way?

Not sure. I have to think about this first.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-02-08 17:07:57 Re: Internal key management system
Previous Message Tom Lane 2020-02-08 16:57:09 Re: ALTER TABLE rewrite to use clustered order