Re: 64 bit transaction id

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Павел Ерёмин <shnoor111gmail(at)yandex(dot)ru>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 64 bit transaction id
Date: 2019-11-03 19:15:22
Message-ID: 20191103191522.jkrv6eik73r3ulzo@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Nov 03, 2019 at 02:17:15PM +0300, Павел Ерёмин wrote:
> I completely agree with all of the above. Therefore, the proposed
> mechanism may entail larger improvements (and not only VACUUM).

I think the best think you can do is try implementing this ...

I'm afraid the "improvements" essentially mean making various imporant
parts of the system much more complicated and expensive. There's a
trade-off between saving 8B per row and additional overhead (during
vacuum etc.), and it does not seem like a winning strategy. What started
as "we can simply look at the next row version" is clearly way more
complicated and expensive.

The trouble here is that it adds dependency between pages in the data
file. That for example means that during cleanup of a page it may be
necessary to modify the other page, when originally that would be
read-only in that checkpoint interval. That's essentially write
amplification, and may significantly increase the amount of WAL due to
generating FPW for the other page.

> I can offer the following solution.
> For VACUUM, create a hash table.
> VACUUM scanning the table sees that the version (tuple1) has t_ctid filled
> and refers to the address tuple2, it creates a structure into which it
> writes the address tuple1, tuple1.xid, length tuple1 (well, and other
> information that is needed), puts this structure in the hash table by key
> tuple2 addresses.
> VACUUM reaches tuple2, checks the address of tuple2 in the hash table - if
> it finds it, it evaluates the connection between them and makes a decision
> on cleaning.
>

We know VACUUM is already pretty expensive, so making it even more
expensive seems pretty awful. And the proposed solution seems damn
expensive. We already do something similar for indexes - we track
pointers for removed rows, so that we can remove them from indexes. And
it's damn expensive because we don't know where in the index the tuples
are - so we have to scan the whole indexes.

This would mean we have to do the same thing for table, because we don't
know where in the table are the older versions of those rows, because we
don't know where the other rows are. That seems mighty expensive.

Not to mention that this does nothing for page-level vacuum, which we
do when trying to fit another row on a page (e.g. for HOT). This has to
be absolutely cheap, we certainly are not going to do lookups of other
pages or looking for older versions of the row, and so on.

Being able to do visibility decisions based on the tuple alone (or
possibly page-level + tuple information) has a lot of value, and I don't
think we want to make this more complicated.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2019-11-03 19:58:39 Re: Getting psql to redisplay command after \e
Previous Message Josef Šimánek 2019-11-03 17:25:45 [PATCH] Include triggers in EXPLAIN