Re: POC: Cleaning up orphaned files using undo logs

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, vignesh C <vignesh21(at)gmail(dot)com>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2019-08-24 12:21:30
Message-ID: CA+TgmoYhDt=nQeshk_pOJvhvsgfXdxLD3+g2+__FVtU_Yb4x7Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Aug 23, 2019 at 2:04 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> 2. Strict self-update-only: We could update it as part of
> transaction cleanup. That is, when you commit or abort, probably at
> some time when your transaction is still advertised as running, you go
> and update your own transaction header with your the size. If you
> never reach that stage, I think you can fix it during crash recovery,
> during the startup scan that feeds the rollback request queues. That
> is, if you encounter a transaction header with length == 0, it must be
> the final one and its length is therefore known and can be updated,
> before you allow new transactions to begin. There are some
> complications like backends that exit without crashing, which I
> haven't thought about. As Amit just pointed out to me, that means
> that the update is not folded into the same buffer access as the next
> transaction, but perhaps you can mitigate that by not updating your
> header if the next header will be on the same page -- the next
> transaction can do it safely then (this page with the insert pointer
> on it can't be discarded). As Dilip just pointed out to me, it means
> that you always do an update that you might not never need to do if
> the transaction is discarded, to which I have no answer. Bleugh.

Andres and I have spent a lot of time on the phone over the last
couple of days and I think we both kind of like this option. I don't
think that the costs are likely to be very significant: you're talking
about pinning, locking, dirtying, unlocking, and unpinning one buffer
at commit time, or maybe two if your transaction touched both logged
and unlogged tables. If the transaction is short enough for that
overhead to matter, that buffer is probably already in shared_buffers,
is probably already dirty, and is probably already in your CPU's
cache. So I think the overhead will turn out to be low.

Moreover, I doubt that we want to separately discard every transaction
anyway. If you have very light-weight transactions, you don't want to
add an extra WAL record per transaction anyway. Increasing the number
of separate WAL records per transaction from say 5 to 6 would be a
significant additional cost. You probably want to perform a discard,
say, every 5 seconds or sooner if you can discard at least 64kB of
undo, or something of that sort. So we're not going to save the
overhead of updating the previous transaction header often enough to
make much difference unless we're discarding so aggressively that we
incur a much larger overhead elsewhere. I think.

I am a little concerned about the backends that exit without crashing.
Andres seems to want to treat that case as a bug to be fixed, but I
doubt whether that's going to be practical. We're really only
talking about extreme corner cases here, because
before_shmem_exit(ShutdownPostgres, 0) means we'll
AbortOutOfAnyTransaction() which should RecordTransactionAbort(). Only
if we fail in the AbortTransaction() prior to reaching
RecordTransactionAbort() will we manage to reach the later cleanup
stages without having written an abort record. I haven't scrutinized
that code lately to see exactly how things can go wrong there, but
there shouldn't be a whole lot. However, there's probably a few
things, like maybe a really poorly-timed malloc() failure.

A zero-order solution would be to install a deadman switch. At
on_shmem_exit time, you must detach from any undo log to which you are
connected, so that somebody else can attach to it later. We can stick
in a cross-check there that you haven't written any undo bytes to that
log and PANIC if you have. Then the system must be water-tight.
Perhaps it's possible to do better: if we could identify the cases in
which such logic gets reached, we could try to guarantee that WAL is
written and the undo log safely detached before we get there. But at
the various least we can promote ERROR/FATAL to PANIC in the relevant
case.

A better solution would be to detect the problem and make sure we
recover from it before reusing the undo log. Suppose each undo log
has three states: (1) nobody's attached, (2) somebody's attached, and
(3) nobody's attached but the last record might need a fixup. When we
start up, all undo logs are in state 3, and the discard worker runs
around and puts them into state 1. Subsequently, they alternate
between states 1 and 2 for as long as the system remains up. But if
as an exceptional case we reach on_shmem_exit without having detached
the undo log, because of cascading failures, then we put the undo log
in state 3. The discard worker already knows how to move undo logs
from state 3 to state 1, and it can do the same thing here. Until it
does nobody else can reuse that undo log.

I might be missing something, but I think that would nail this down
pretty tightly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Emre Hasegeli 2019-08-24 12:23:11 Re: mingw32 floating point diff
Previous Message Peter Eisentraut 2019-08-24 10:22:06 Re: using explicit_bzero