Re: DRAFT: Pass sk_attno to consistent function

From: Michał Kłeczek <michal(at)kleczek(dot)org>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: DRAFT: Pass sk_attno to consistent function
Date: 2024-03-23 16:32:35
Message-ID: A2E2B18B-6A5D-4BEC-965B-81D0E5A62AD6@kleczek.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On 22 Mar 2024, at 10:11, Michał Kłeczek <michal(at)kleczek(dot)org> wrote:
>
>>
>> On 22 Mar 2024, at 01:29, Michał Kłeczek <michal(at)kleczek(dot)org> wrote:
>>
>> 
>>
>>> On 21 Mar 2024, at 23:42, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> wrote:
>>>
>>>
>>> You seem to already be using your own operator class, so you may want
>>> to look into CREATE FUNCTION's support_function parameter; which would
>>> handle SupportRequestIndexCondition and/or SupportRequestSimplify.
>>> I suspect a support function that emits multiple clauses that each
>>> apply to only a single partition at a time should get you quite far if
>>> combined with per-partition constraints that filter all but one of
>>> those. Also, at plan-time you can get away with much more than at
>>> runtime.
>
> Thinking about it some more - the suggestion goes backwards - ie. there must have been some deep misunderstanding:
>
> If I was able to write my query such that the planner generates optimal plan, I would not implement the custom operator in the first place.

To make it more concrete:

create extension pg_trgm;
create table t ( pk text not null primary key, data text not null ) partition by hash (pk);
create table t_2_0 partition of t for values with ( modulus 2, remainder 0 );
create table t_2_0 partition of t for values with ( modulus 2, remainder 1 );
create index ti on t using gist ( pk, data gist_trgm_ops);

explain select * from t where pk = any (array['1', '2', '4', '5']) and data % 'what' order by data <-> 'what' limit 5;

Limit (cost=41.32..41.33 rows=2 width=21)
-> Sort (cost=41.32..41.33 rows=2 width=21)
Sort Key: ((t.data <-> 'what'::text))
-> Append (cost=16.63..41.31 rows=2 width=21)
-> Bitmap Heap Scan on t_2_0 t_1 (cost=16.63..20.65 rows=1 width=21)
Recheck Cond: ((pk = ANY ('{1,2,4,5}'::text[])) AND (data % 'what'::text))
-> Bitmap Index Scan on t_2_0_pk_data_idx (cost=0.00..16.63 rows=1 width=0)
Index Cond: ((pk = ANY ('{1,2,4,5}'::text[])) AND (data % 'what'::text))
-> Bitmap Heap Scan on t_2_1 t_2 (cost=16.63..20.65 rows=1 width=21)
Recheck Cond: ((pk = ANY ('{1,2,4,5}'::text[])) AND (data % 'what'::text))
-> Bitmap Index Scan on t_2_1_pk_data_idx (cost=0.00..16.63 rows=1 width=0)
Index Cond: ((pk = ANY ('{1,2,4,5}'::text[])) AND (data % 'what'::text))

That’s bad as the number of records to sort might be huge. So:

set enable_bitmapscan to off;

Limit (cost=0.30..242.85 rows=2 width=21)
-> Merge Append (cost=0.30..242.85 rows=2 width=21)
Sort Key: ((t.data <-> 'what'::text))
-> Index Scan using t_2_0_pk_data_idx on t_2_0 t_1 (cost=0.15..121.40 rows=1 width=21)
Index Cond: (data % 'what'::text)
Order By: (data <-> 'what'::text)
Filter: (pk = ANY ('{1,2,4,5}'::text[]))
-> Index Scan using t_2_1_pk_data_idx on t_2_1 t_2 (cost=0.15..121.42 rows=1 width=21)
Index Cond: (data % 'what'::text)
Order By: (data <-> 'what'::text)
Filter: (pk = ANY ('{1,2,4,5}'::text[]))

That’s bad as well since pk = ANY([…]) is not in Index Cond.

Lets use ||= operator then:

drop index ti;
create index ti on t using gist ( pk gist_extra_text_ops, data gist_trgm_ops);

explain select * from t where pk = any (array['1', '2', '4', '5']) and pk ||= array['1', '2', '4', '5'] and data % 'what' order by data <-> 'what' limit 5;

Limit (cost=0.30..153.70 rows=2 width=21)
-> Merge Append (cost=0.30..153.70 rows=2 width=21)
Sort Key: ((t.data <-> 'what'::text))
-> Index Scan using t_2_0_pk_data_idx on t_2_0 t_1 (cost=0.15..76.84 rows=1 width=21)
Index Cond: ((pk ||= '{1,2,4,5}'::text[]) AND (data % 'what'::text))
Order By: (data <-> 'what'::text)
Filter: (pk = ANY ('{1,2,4,5}'::text[]))
-> Index Scan using t_2_1_pk_data_idx on t_2_1 t_2 (cost=0.15..76.84 rows=1 width=21)
Index Cond: ((pk ||= '{1,2,4,5}'::text[]) AND (data % 'what'::text))
Order By: (data <-> 'what'::text)
Filter: (pk = ANY ('{1,2,4,5}'::text[]))

That’s much better. There is still Filter on SAOP but it is almost harmless: always true and does not require heap access.

A few observations:
1. I am not sure why SAOP can be Index Cond in case of Bitmap Index Scan but not in case of Index Scan - don’t know what the interplay between the planner and amsearcharray is.
2. In all cases array passed to (Bitmap) Index Scan is NOT filtered by partition pruning.

The second point means useless index page reads (amplified by the number of elements in input array and the width of the index).

It is the _second_ point I would like to address with this patch - for us it makes a big difference as it means almost order of magnitude (around 5-10 times based on preliminary tests) fewer buffers read.

I’ve done an experiment with adding partitioning bounds information as options when creating index. Table as above (so only two index columns) but with 128 partitions (modulus 128).

Without filtering array elements we get 103 buffers read:

explain (analyse, buffers) select * from t where pk ||= array['1', '5', '10', '2', '3', '4', '23', '5', '7', '0'] and pk = any (array['1', '5', '10', '2', '3', '4', '23', '5', '7', '0']) and data % 'ever 3' order by data <-> '3' limit 5;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2.82..43.07 rows=5 width=32) (actual time=8.093..9.390 rows=4 loops=1)
Buffers: shared hit=103
-> Merge Append (cost=2.82..75.28 rows=9 width=32) (actual time=8.091..9.387 rows=4 loops=1)
Sort Key: ((t.data <-> '3'::text))
Buffers: shared hit=103
-> Index Scan using t_128_9_pk_data_idx on t_128_9 t_1 (cost=0.30..8.33 rows=1 width=32) (actual time=0.741..0.741 rows=0 loops=1)
Index Cond: ((pk ||= '{1,5,10,2,3,4,23,5,7,0}'::text[]) AND (data % 'ever 3'::text))
Order By: (data <-> '3'::text)
Filter: (pk = ANY ('{1,5,10,2,3,4,23,5,7,0}'::text[]))
Buffers: shared hit=10
… more partitions

After creating indexes for all partitions with some metadata used by consistent function to filter out array elements:
create index ti_128_0 on t_128_0 (pk gist_extra_text_ops (modulus=128, remainder=0), data gist_trgm_ops (siglen=128))

The result is only 29 buffers read:

explain (analyse, buffers) select * from t where pk ||= array['1', '5', '10', '2', '3', '4', '23', '5', '7', '0'] and pk = any (array['1', '5', '10', '2', '3', '4', '23', '5', '7', '0']) and data % 'ever 3' order by data <-> '3' limit 5;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2.82..43.07 rows=5 width=32) (actual time=0.852..0.864 rows=4 loops=1)
Buffers: shared hit=29
-> Merge Append (cost=2.82..75.28 rows=9 width=32) (actual time=0.850..0.862 rows=4 loops=1)
Sort Key: ((t.data <-> '3'::text))
Buffers: shared hit=29
-> Index Scan using ti_128_9 on t_128_9 t_1 (cost=0.30..8.33 rows=1 width=32) (actual time=0.045..0.045 rows=0 loops=1)
Index Cond: ((pk ||= '{1,5,10,2,3,4,23,5,7,0}'::text[]) AND (data % 'ever 3'::text))
Order By: (data <-> '3'::text)
Filter: (pk = ANY ('{1,5,10,2,3,4,23,5,7,0}'::text[]))
Buffers: shared hit=1
…. more partitions

There is 3x fewer buffers read in this case (103 vs 29).
It makes a huge difference in memory pressure in our case (10 billion rows 10 TB data, wide index to support various search criteria - text/similiarity search among them).

I understand extensibility of GIST makes many operators opaque to the planner and it is the “consistent” function that can perform optimisations (or we can come up with some additional mechanism during planning phase).
Providing more information to “consistent” function would make it possible to implement optimisations not only for array scan keys but for ranges and others.

What we can do is to provide the index attribute number (reduntantly) as option - it is going to work but is somewhat ugly - especially that this information is already available when calling consistent function.


Michal

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Paul Jungwirth 2024-03-23 17:04:04 altering a column's collation leaves an invalid foreign key
Previous Message Tom Lane 2024-03-23 16:31:51 Re: [PATCH] plpython function causes server panic