From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | pg_serial bloat |
Date: | 2023-12-14 20:53:36 |
Message-ID: | CA+hUKG+HQhPqZMOYpKJ18BD0ERqO7XDovqFzu293fB1ePQ3tzA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
Our pg_serial truncation logic is a bit broken, as described by the
comments in CheckPointPredicate() (a sort of race between xid cycles
and checkpointing). We've seen a system with ~30GB of files in there
(note: full/untruncated be would be 2³² xids × sizeof(uint64_t) =
32GB). It's not just a gradual disk space leak: according to disk
space monitoring, this system suddenly wrote ~half of that data, which
I think must be the while loop in SerialAdd() zeroing out pages.
Ouch.
I see a few questions:
1. How should we fix this fundamentally in future releases? One
answer is to key SSI's xid lookup with FullTransactionId (conceptually
cleaner IMHO but I'm not sure how far fxids need to 'spread' through
the system to do it right). Another already mentioned in comments is
to move some logic into vacuum so it can stay in sync with the xid
cycle (maybe harder to think about and prove correct).
2. Could there be worse consequences than wasted disk and I/O?
3. Once a system reaches a bloated state like this, what can an
administrator do?
I looked into question 3. I convinced myself that it must be safe to
unlink all the files under pg_serial while the cluster is down,
because:
* we don't need the data across restarts, it's just for spilling
* we don't need the 'head' file because slru.c opens with O_CREAT
* open(O_CREAT) followed by pwrite(..., offset) will create a harmless hole
* we never read files outside the tailXid/headXid range we've written
* we zero out pages as we add them in SerialAdd(), without reading
If I have that right, perhaps we should not merely advise that it is
safe to do that manually, but proactively do it in SerialInit(). That
is where we establish in shared memory that we don't expect there to
be any files on disk, so it must be a good spot to make that true if
it is not:
if (!found)
{
/*
* Set control information to reflect empty SLRU.
*/
serialControl->headPage = -1;
serialControl->headXid = InvalidTransactionId;
serialControl->tailXid = InvalidTransactionId;
+
+ /* Also delete any files on disk. */
+ SlruScanDirectory(SerialSlruCtl, SlruScanDirCbDeleteAll, NULL);
}
In common cases that would just readdir() an empty directory.
For testing, it is quite hard to convince predicate.c to write any
files there: normally you have to overflow its transaction tracking,
which requires more than (max backends + max prepared xacts) × 10
SERIALIZABLE transactions in just the right sort of overlapping
pattern, so that the committed ones need to be spilled to disk. I
might try to write a test for that, but it gets easier if you define
TEST_SUMMARIZE_SERIAL. Then you don't need many transactions -- but
you still need a slightly finicky schedule. Start with a couple of
overlapping SSI transactions, then commit them, to get a non-empty
FinishedSerializableTransaction list. Then create some more SSI
transactions, which will call SerialAdd() due to the TEST_ macro.
Then run a checkpoint, and you should see eg "0000" being created on
demand during SLRU writeback, demonstrating that starting from an
empty pg_serial directory is always OK. I wanted to try that to
remind myself of how it all works, but I suppose it should be obvious
that it's OK: initdb's initial state is an empty directory.
To create a bunch of junk files that are really just thin links for
the above change to unlink, or to test the truncate code when it sees
a 'full' directory, you can do:
cd pg_serial
dd if=/dev/zero of=0000 bs=256k count=1
awk 'BEGIN { for (i = 1; i <= 131071; i++) { printf("%04X\n", i); } }'
| xargs -r -I {} ln 0000 {}
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2023-12-14 21:38:17 | Re: Teach predtest about IS [NOT] <boolean> proofs |
Previous Message | Masahiko Sawada | 2023-12-14 20:48:52 | Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features) |