Re: [PATCHES] Non-transactional pg_class, try 2

From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [PATCHES] Non-transactional pg_class, try 2
Date: 2006-06-26 18:21:07
Message-ID: 20060626182107.GE8999@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

[Resending: apparently the previous one to the list was eaten by spam
filters or something. Changing SMTP relay again ... ]

Ok, let's step back to discuss this again. Sorry for the length -- this
is a description of the problem I'm trying to solve, the issues I found,
and how I tried to solve them.

The relminxid Patch
===================

What I'm after is not freezing for read-only media, nor archive, nor
read-only tables. What I'm after is removing the requirement that all
databases must be vacuumed wholly every 2 billion transactions.

Now, why do we need to vacuum whole databases at a time?

The Transaction Id Counter
==========================

We know that the Xid counter is weird; it cycles, for starters, and also
there are special values at the "start" of the cycle that are lesser
than all other values (BootstrapXid, FrozenXid). The idea here is to
allow the counter to wrap around and old tuples not be affected, i.e.,
appear like they were committed in some distant past.

So we use the special Xid values to mark special stuff, like tuples
created by the bootstrap processing (which are always known to be good)
or tuples in template databases that are not connectable ("frozen"
databases). We also use FrozenXid to mark tuples that are very old,
i.e. were committed a long time ago and never deleted. Any such tuple
is unaffected by the status of the Xid counter.

It should be clear that we must ensure that after a suitable amount of
"time" (measured in advancement of the Xid counter) has passed, we
should change the old Xids in tuples to the special FrozenXid value.
The requirement for whole-database vacuuming is there because we
need to ensure that this is done in all the tables in the database.

We keep track of a "minimum Xid", call it minxid. The Xid generator
refuses to assign a new Xid counter if this minxid is too far in the
past, because we'd risk causing Xid-wraparound data loss if we did; the
Xid comparison semantics would start behaving funny, and some tuples
that appeared to be alive not many transactions ago now suddenly appear
dead. Clearly, it's important that before we advance this minxid we
ensure that all tables in the database have been under the process of
changing all regular Xids into FrozenXid.

Currently the only way to ensure that all tables have gone through this
process is processing them in a single VACUUM pass. Skip even one
table, and you can forget about advancing the minxid. Even if the
skipped table was vacuumed in the transaction just before this one.
Even if the table is fully frozen, i.e., all tables on it are marked
with FrozenXid. Even if the table is empty.

Tracking minxid Per Table
=========================

So, my idea is to track this minxid per table. To do this, I added a
column to pg_class called relminxid. The minimum of it across a
database is used to determine each database's minimum, datminxid. The
minimum of all databases is used to advance the global minimum Xid
counter.

So, if a table has 3 tuples whose Xmins are 42, 512 and FrozenXid, the
relminxid is 42. If we keep track of all these religiously during
vacuum, we know exactly what is the minxid we should apply to this
particular table.

It is obvious that vacuuming one table can set the minimum for that
table. So when the vacuuming is done, we can recalculate the database
minimum; and using the minima of all databases, we can advance the
global minimum Xid counter and truncate pg_clog. We can do this on each
single-table vacuum -- so, no more need for database-wide vacuuming.

If a table is empty, or all tuples on it are frozen, then we must mark
the table with relminxid = RecentXmin. This is because there could be
an open transaction that writes a new tuple to the table after the
vacuum is finished. A newly created table must also be created with
relminxid = RecentXid. Because of this, we never mark a table with
relminxid = FrozenXid.

Template Databases
==================

Up to this point everything is relatively simple. Here is where the
strange problems appear. The main issue is template databases.

Why are template databases special? Because they are never vacuumed.
More generally, we assume that every database that is marked as
"datallowconn = false" is fully frozen, i.e. all tables on it are
frozen. Autovacuum skips them. VACUUM ignores them. The minxid
calculations ignore them. They are fully frozen so they don't matter
and they don't harm anybody.

That's fine and dandy until you realize what happens when you freeze a
database, let a couple billion transactions pass, and then create a
database using that as a template (or just "reallow connections" to a
database). Because all the tables were frozen 2 billion transaction
ago, they are marked with an old relminxid, so as soon as you vacuum any
table, the minxid computations went to hell, and we have a DoS
condition.

So, we have to do something to cope with frozen databases. I see two
ways:

1. Remove the special case, i.e., process frozen databases in VACUUM
like every other database.
This is the easiest, because no extra logic is needed. Just make
sure they are vacuumed in time. The only problem would be that we'd
need to uselessly vacuum tables that we know are frozen, from time to
time. But then, those tables are probably small, so what's the
problem with that?

2. Mark frozen databases specially somehow.
To mark databases frozen, we need a way to mark tables as frozen.
How do we do that? As I explain below, this allows some nice
optimizations, but it's a very tiny can full of a huge amount of
worms.

Marking a Table Frozen
======================

Marking a table frozen is simple as setting relminxid = FrozenXid for a
table. As explained above, this cannot be done in a regular postmaster
environment, because a concurrent transaction could be doing nasty stuff
to a table. So we can do it only in a standalone backend.

On the other hand, a "frozen" table must be marked with relminxid =
a-regular-Xid as soon as a transaction writes some tuples on it. Note
that this "unfreezing" must take place even if the offending transaction
is aborted, because the Xid is written in the table nevertheless and
thus it would be incorrect to lose the unfreezing.

This is how pg_class_nt came into existence -- it would be a place where
information about a table would be stored and not subject to the rolling
back of the transaction that wrote it. So if you find that a table is
frozen, you write an unfreezing into its pg_class_nt tuple, and that's
it.

Nice optimization: if we detect that a table is fully frozen, then
VACUUM is a no-op (not VACUUM FULL), because by definition there are no
tuples to remove.

Another optimization: if we are sure that unfreezing works, we can even
mark a table as frozen in a postmaster environment, as long as we take
an ExclusiveLock on the table. Thus we know that the vacuum is the sole
transaction concurrently accessing the table; and if another transaction
comes about and writes something after we're finished, it'll correctly
unfreeze the table and all is well.

Where are the problems in this approach?

1. Performance. We'll need to keep a cache of pg_class_nt tuples. This
cache must be independent of the current relcache, because the relcache
is properly transactional while the pg_class_nt cache must not be.

2. The current implementation puts the unfreezing in LockRelation. This
is a problem, because any user can cause a LockRelation on any table,
even if the user does not have access to that table.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2006-06-26 18:25:45 Re: ERROR: invalid page header in block
Previous Message Simon Riggs 2006-06-26 18:17:35 Re: "Truncated" tuples for tuple hash tables

Browse pgsql-patches by date

  From Date Subject
Next Message Bruce Momjian 2006-06-26 18:52:50 Re: Overhead for stats_command_string et al, take 2
Previous Message Tom Lane 2006-06-26 18:16:23 Re: [HACKERS] Overhead for stats_command_string et al, take 2