Buffering in tuplesort.c for in-sort deduplication; nbtree edition

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andy Fan <zhihuifan1213(at)163(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(at)vondra(dot)me>
Subject: Buffering in tuplesort.c for in-sort deduplication; nbtree edition
Date: 2024-12-12 00:21:52
Message-ID: CAEze2WhRFzd=nvh9YevwiLjrS1j1fP85vjNCXAab=iybZ2rNKw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

In the GIN parallel build thread [0] I proposed adding a 'flush'
callback to the tuplesort API [1], to be called by the tuplesort
machinery after writing each run, so that users of tuplesort can
buffer write calls within each sorted run and use that buffer state to
merge/deduplicate sorttuples before writing them to disk (assuming
some invariants are held).
By implementing deduplication the user can thus reduce the number of
sortable tuples and the total working set size significantly, thus
later reducing time spent reading and processing those runs and
speeding up the build time for tuplesorts with many duplicate sort
keys, but with mergable output.

As the GIN parallel index patch was already large and complex enough,
it was suggested to split it into a separate patch. I'd said that I'd
try to build one, so here is a patch that adds the required API to the
tuplesort internals, and implements the 'buffer, deduplicate, flush'
idea for nbtree index builds that can benefit from deduplication.

If deduplication is enabled for the btree index, we now merge tuples
up to the tuplesort buffer size while we're still extracting tuples
from the heap, thereby reducing the temporary storage requirement of
the index build.

One complication that this patch needs to deal with is that parallel
scans (and syncscan wraparounds) can cause some higher TIDs to appear
before lower tids in two posting lists (e.g. lists [1, 4] and [2, 3]).
Therefore, we must take some special care while loading the tree to
make sure we only write out TIDs which are smaller than the latest
tuple's TID; which is implemented with a growing TID buffer.

As part of these changes, there are some changes to the btree build
behaviour. Where previously we wouldn't ever create posting lists
larger than the target posting list size (812 bytes), we can now emit
some posting tuples with up to 10 [^3] TIDs and up to BTMaxItemSize
large when the non-deduplicated base tuple would otherwise be large
enough to consume that 812-byte posting list size limit.

Local testing shows that index builds can see 60%+ reduced temporary
disk usage when indexing values with high duplication factors -storage
comparable to the resulting index- and I've seen 50% index reduced
build times for some text-based deduplication workloads.

Note: this isn't yet very polished, but it works. I'm reasonably happy
with happy-path performance so far, but I've seen bad cases (unique
dataset without UNIQUE specifier) regress by as much as 30%, so this
is definitely not (yet?) a panacea.

It should be noted that UNIQUE btree builds still make use of the old
non-duplicating tuplesort code, meaning they can be used to detect
regressions in this patch (vs a non-UNIQUE index build with otherwise
the same definition).

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] https://postgr.es/m/flat/6ab4003f-a8b8-4d75-a67f-f25ad98582dc%40enterprisedb.com
[1] https://postgr.es/m/CAEze2WjB1vpxtvKuWVEThSaB-v4%2B8H0EXsOB%3DyLAv8pLcrQuKw%40mail.gmail.com
[2] https://postgr.es/m/CAEze2WiTAeZe4t5wAeRN834xFBqROPmjeK2XTstNko6bbVPX%3DA%40mail.gmail.com

[^3] Chosen by fair dice roll

Attachment Content-Type Size
v0-0001-Allow-tuplesort-implementations-to-buffer-writes.patch application/x-patch 6.0 KB
v0-0002-Add-tuplesort-level-deduplication-to-nbtree-build.patch application/octet-stream 46.8 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2024-12-12 00:34:29 Re: Adding a '--two-phase' option to 'pg_createsubscriber' utility.
Previous Message Michael Paquier 2024-12-12 00:20:15 Re: Fix comments related to pending statistics