Re: Lock-free compaction. Why not?

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: David Rowley <dgrowleyml(at)gmail(dot)com>, Ahmed Yarub Hani Al Nuaimi <ahmedyarubhani(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Subject: Re: Lock-free compaction. Why not?
Date: 2024-07-22 12:39:23
Message-ID: CA+TgmoZ=5PELaaQ1g-Xj5PFQYkgqHLDqk0hnWQvHC7sB1M14kQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jul 20, 2024 at 10:13 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> The actually tricky part about that is that you have to ensure that
> any concurrent scan will see one of the two copies --- not both,
> and not neither. This is fairly hard when the concurrent query
> might be using any of several scan methods, and might or might not
> have visited the original tuple before you commenced the move.
> You can solve it by treating the move more like an UPDATE, that
> is the new tuple gets a new XID, but that has its own downsides;
> notably that it must block/be blocked by concurrent real UPDATEs.

Yeah, this is always the part that has stumped me. I think there's
probably some way to find bit space to indicate whether an update is a
"real" update or a "move-the-tuple" update, although it might take a
bunch of engineering to get the job done, or otherwise be kind of
awkward in some way. But then what? You're still left with writing a
bunch of new index entries for the tuple which will lead to bloating
the indexes as you try to organize the heap, so I imagine that you
have to move relatively small batches of tuples and then vacuum and
then repeat, which seems like it will take forever on a big table.

If you imagine a hypothetical world in which the block number in the
index entry is a logical block number rather than a physical block
number, then you could imagine shuffling the physical position of the
blocks around to put the non-empty logical blocks at the beginning of
the relation and then throw away the empty ones. You might still want
to migrate some rows from partially-filled blocks to empty ones, but
that seems like it would often require reindexing vastly fewer rows.
Imagine for example a relation with ten half-empty blocks at the
beginning followed by a million completely full blocks. But now you've
also turned sequential scans into random I/O. Probably it works better
if the logical->physical mapping works in multi-megabyte chunks rather
than block by block, but that seems like an awful lot of engineering
to do as foundational work, especially given that even after you do
all of that and build some kind of tuple relocator on top of it, you
still need a way to move around individual tuples when relocating
chunks isn't good enough.

What the extensions that are out there seem to do is, as I understand
it, an online table rewrite with concurrent change capture, and then
you apply the changes to the output table afterward. That has the
problem that if the changes are happening faster than you can apply
them, the operation does not terminate. But, enough people seem to be
happy with this kind of solution that we should perhaps look harder at
doing something along these lines in core.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Banck 2024-07-22 12:42:41 Re: Lock-free compaction. Why not?
Previous Message torikoshia 2024-07-22 12:37:03 Re: Add new COPY option REJECT_LIMIT