Re: Index tuple deduplication limitations in pg13

From: Matthias van de Meent <matthias(dot)vandemeent(at)cofano(dot)nl>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: pgsql-general General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Index tuple deduplication limitations in pg13
Date: 2020-08-18 18:51:52
Message-ID: CAAs3B9obUeXBhGGcFjoSVn8+C+g4Dd0Aa0t5p+F3BaDVVfZnvw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, 18 Aug 2020 at 18:44, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Mon, Aug 17, 2020 at 11:44 PM Matthias van de Meent
> <matthias(dot)vandemeent(at)cofano(dot)nl> wrote:
> > But, if the ordering of operator-class equal tuples is already
> > system-defined, could the physical ordering of index tuples in a btree
> > (with deduplication enabled for "unsafe" opclasses) be updated from
> > [index_columns, tid] to [index_columns,
> > image_compare(non_datum_equal_columns), tid], giving a stable sorting
> > of opclass-equal and image-equal values and enabling safe consistent
> > deduplication?
>
> The issue isn't the physical ordering. The issue is that we cannot
> allow the implementation to destroy semantic differences among equal
> datums.

Deduplication does not need to destroy semantic differences? 'equal'
can (in my book) mean:
- 'opclass-equal', that is the opclass returns true for an equality check
- 'binary equal' or 'datum-equal' (? maybe incorrect term), that is
the unprocessed on-disk representations (datum image is the right term
I believe?) of the compared values are indistinguishable.

Runs of 'binary equal' datums can be freely deduplicated [0] when found.

> We avoid deduplication with cases where two equal datums are
> visibly different. For example, it would not be okay if we forgot that
> your numeric datum was originally input as '5.000', and output '5'
> later on.

I agree on that point. But, isn't this display scale also stored in
the 'datum image' [1] and could therefore be distinguished against for
two opclass-equal values whilst deduplicating, only putting binary
equal values in a posting list? e.g. all values generated through
'0.000'::numeric -values can be compressed into one posting tuple
without information loss, as the datum image is effectively the same
for all of these (unless I have missed something), and distinct from
the datum image of '0'::numeric. With enough equal values, you would
eventually have a posting lists for each distinct datum image of equal
values.

Given that the above could work, the current btree tuple ordering is
not optimized for opclass-equal but datum image-distinct values:
ordering of opclass-equal values is currently determined only by tid,
with as an example current ordering ['0.0', '0', '0.00', '0', '0.0',
'0']. It would be more optimized for deduplication if that was stored
as e.g. ['0', '0', '0', '0.0', '0.0', '0.00'], which is why I
suggested to add an ordering by the datum image before the tid
ordering. Additionally, this extra ordering also prevents the problem
of [0] by never attempting an insertion of non-equal image datums in a
posting list of otherwise equal values, as it would be ordered either
before or after the posting list, never inside the list.

> If we wanted to fix this for numeric, we'd have to invent a new
> numeric datatype (called numeric2, say). That probably isn't as hard
> as it sounds, since it could be part of the same B-Tree operator
> family as numeric. It could also be implicitly cast to numeric.
>
> --
> Peter Geoghegan

[0]
Inserting a row in a deduplicated index with in, with TID ntid, can
encounter a posting list of a opclass-equal but not datum image-equal
tuples where the lowest TID of the posting list is less than ntid, and
ntid is less than the highest TID of the posting list. This would
require a posting list split to accomodate the new tuples' index entry
in order to not lose data.

[1]
# a table with 1000 distinct values, each duplicated 700 times, but
split into 7x100 'datum-equal' value groups that should be
deduplicatable.
SELECT (n % 1000 || '.' || repeat('0', n%7))::numeric AS num INTO
numbers FROM generate_series(1, 700000) series(n);
CREATE INDEX numeric_index ON numbers (num);
ANALYZE numbers;
SELECT num FROM numbers ORDER BY num LIMIT 700; -- this is an
index-only scan, and results show numeric scale, so the information
must be stored in the index.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Peter Geoghegan 2020-08-18 19:28:37 Re: Index tuple deduplication limitations in pg13
Previous Message Scottix 2020-08-18 18:05:22 Re: "Go" (lang) standard driver