Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
To: Peter Geoghegan <pg(at)bowt(dot)ie>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Юрий Соколов <funny(dot)falcon(at)gmail(dot)com>
Subject: Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Date: 2020-02-19 16:14:03
Message-ID: ed6caece-5a64-ad02-2134-1e71fb4b35a1@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 14.02.2020 05:57, Peter Geoghegan wrote:

> Attached is v33, which adds the last piece we need: opclass
> infrastructure that tells nbtree whether or not deduplication can be
> applied safely. This is based on work by Anastasia that was shared
> with me privately.
Thank you for this work. I've looked through the patches and they seem
to be ready for commit.
I haven't yet read recent documentation and readme changes, so maybe
I'll send some more feedback tomorrow.

> New opclass proc
> ================
>
> In general, supporting deduplication is the rule for B-Tree opclasses,
> rather than the exception. Most can use the generic
> btequalimagedatum() routine as their support function 4, which
> unconditionally indicates that deduplication is safe. There is a new
> test that tries to catch opclasses that omitted to do this. Here is
> the opr_sanity.out changes added by the first patch:
>
> -- Almost all Btree opclasses can use the generic btequalimagedatum function
> -- as their equalimage proc (support function 4). Look for opclasses that
> -- don't do so; newly added Btree opclasses will usually be able to support
> -- deduplication with little trouble.
> SELECT amproc::regproc AS proc, opf.opfname AS opfamily_name,
> opc.opcname AS opclass_name, opc.opcintype::regtype AS opcintype
> FROM pg_am am
> JOIN pg_opclass opc ON opc.opcmethod = am.oid
> JOIN pg_opfamily opf ON opc.opcfamily = opf.oid
> LEFT JOIN pg_amproc ON amprocfamily = opf.oid AND
> amproclefttype = opcintype AND
> amprocnum = 4
> WHERE am.amname = 'btree' AND
> amproc IS DISTINCT FROM 'btequalimagedatum'::regproc
> ORDER BY amproc::regproc::text, opfamily_name, opclass_name;
> proc | opfamily_name | opclass_name | opcintype
> -------------------+------------------+------------------+------------------
> bpchar_equalimage | bpchar_ops | bpchar_ops | character
> btnameequalimage | text_ops | name_ops | name
> bttextequalimage | text_ops | text_ops | text
> bttextequalimage | text_ops | varchar_ops | text
> | array_ops | array_ops | anyarray
> | enum_ops | enum_ops | anyenum
> | float_ops | float4_ops | real
> | float_ops | float8_ops | double precision
> | jsonb_ops | jsonb_ops | jsonb
> | money_ops | money_ops | money
> | numeric_ops | numeric_ops | numeric
> | range_ops | range_ops | anyrange
> | record_image_ops | record_image_ops | record
> | record_ops | record_ops | record
> | tsquery_ops | tsquery_ops | tsquery
> | tsvector_ops | tsvector_ops | tsvector
> (16 rows)
>

Is there any specific reason, why we need separate btnameequalimage,
bpchar_equalimage and bttextequalimage functions?
As far as I see, they have the same implementation.
> Since using deduplication is supposed to pretty much be the norm from
> now on, it seemed like it might make sense to add a NOTICE about it
> during CREATE INDEX -- a notice letting the user know that it isn't
> being used due to a lack of opclass support:
>
> regression=# create table foo(bar numeric);
> CREATE TABLE
> regression=# create index on foo(bar);
> NOTICE: index "foo_bar_idx" cannot use deduplication
> CREATE INDEX
>
> Note that this NOTICE isn't seen with an INCLUDE index, since that's
> expected to not support deduplication.
>
> I have a feeling that not everybody will like this, which is why I'm
> pointing it out.
>
> Thoughts?

I would simply move it to debug level for all cases. Since from user's
perspective it doesn't differ that much from the case where
deduplication is applicable in general, but not very efficient due to
data distribution.
I also noticed that this is not consistent with ALTER index. For
example, alter index idx_n set (deduplicate_items =true); won't show any
message about deduplication.

I've tried several combinations with an index on a numeric column:

1) postgres=# create index idx_nd on tbl (n) with (deduplicate_items =
true);
NOTICE:  index "idx_nd" cannot use deduplication
CREATE INDEX

Here the message seems appropriate. I don't think, we should restrict
creation of the index even when deduplicate_items parameter is set
explicitly, rather we may warn the user that it won't be efficient.

2) postgres=# create index idx_n on tbl (n) with (deduplicate_items =
false);
NOTICE:  index "idx_n" cannot use deduplication
CREATE INDEX

In this case the message seems slightly strange to me.
Why should we show a notice about the fact that deduplication is not
possible if that is exactly what was requested?

3)
postgres=# create index idx on tbl (n);
NOTICE:  index "idx" cannot use deduplication

In my opinion, this message is too specific for default behavior. It
exposes internal details without explanation and may look to user like
something went wrong.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Juan José Santamaría Flecha 2020-02-19 16:16:32 Re: BUG #16108: Colorization to the output of command-line has unproperly behaviors at Windows platform
Previous Message Masahiko Sawada 2020-02-19 15:44:27 Re: Internal key management system