Re: POC: Cleaning up orphaned files using undo logs

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2019-08-17 16:05:21
Message-ID: CA+TgmoahQqvWC7Vyp-MHHU7Z_qg1LqtB2Q=E0nd0Q60pXktnZg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Aug 14, 2019 at 12:39 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > Again, I think it's not ok to just assume you can lock an essentially
> > > unbounded number of buffers. This seems almost guaranteed to result in
> > > deadlocks. And there's limits on how many lwlocks one can hold etc.
> >
> > I think for controlling that we need to put a limit on max prepared
> > undo? I am not sure any other way of limiting the number of buffers
> > because we must lock all the buffer in which we are going to insert
> > the undo record under one WAL logged operation.
>
> I heard that a number of times. But I still don't know why that'd
> actually be true. Why would it not be sufficient to just lock the buffer
> currently being written to, rather than all buffers? It'd require a bit
> of care updating the official current "logical end" of a log, but
> otherwise ought to not be particularly hard? Only one backend can extend
> the log after all, and until the log is externally visibily extended,
> nobody can read or write those buffers, no?

Well, I don't understand why you're on about this. We've discussed it
a number of times but I'm still confused. I'll repeat my previous
arguments on-list:

1. It's absolutely fine to just put a limit on this, because the
higher-level facilities that use this shouldn't be doing a single
WAL-logged operation that touches a zillion buffers. We have been
careful to avoid having WAL-logged operations touch an unbounded
number of buffers in plenty of other places, like the btree code, and
we are going to have to be similarly careful here for multiple
reasons, deadlock avoidance being one. So, saying, "hey, you're going
to lock an unlimited number of buffers" is a straw man. We aren't.
We can't.

2. The write-ahead logging protocol says that you're supposed to lock
all the buffers at once. See src/backend/access/transam/README. If
you want to go patch that file, then this patch can follow whatever
the locking rules in the patched version are. But until then, the
patch should follow *the actual rules* not some other protocol based
on a hand-wavy explanation in an email someplace. Otherwise, you've
got the same sort of undocumented disaster-waiting-to-happen that you
keep complaining about in other parts of this patch. We need fewer of
those, not more!

3. There is no reason to care about all of the buffers being locked at
once, because they are not unlimited in number (see point #1) and
nobody else is looking at them anyway (see the final sentence of what
I quoted above).

I think we are, or ought to be, talking about locking 2 (or maybe in
rare cases 3 or 4) undo buffers in connection with a single WAL
record. If we're talking about more than that, then I think the
higher-level code needs to be changed. If we're talking about that
many, then we don't need to be clever. We can just do the standard
thing that the rest of the system does, and it will be fine just like
it is everywhere else.

> > Suppose you insert one record for the transaction which split in
> > block1 and 2. Now, before this block is actually going to the disk
> > the transaction committed and become all visible the undo logs are
> > discarded. It's possible that block 1 is completely discarded but
> > block 2 is not because it might have undo for the next transaction.
> > Now, during recovery (FPW is off) if block 1 is missing but block 2 is
> > their so we need to skip inserting undo for block 1 as it does not
> > exist.
>
> Hm. I'm quite doubtful this is a good idea. How will this not force us
> to a emit a lot more expensive durable operations while writing undo?
> And doesn't this reduce error detection quite remarkably?
>
> Thomas, Robert?

I think you're going to need to spell out your assumptions in order
for me to be able to comment intelligently. This is another thing
that seems pretty normal to me. Generally, WAL replay might need to
recreate objects whose creation is not separately WAL-logged, and it
might need to skip operations on objects that have been dropped later
in the WAL stream and thus don't exist any more. This seems like an
instance of the latter pattern. There's no reason to try to put valid
data into pages that we know have been discarded, and both inserting
and discarding undo data need to be logged anyway.

As a general point, I think the hope is that undo generated by
short-running transactions that commit and become all-visible quickly
will be cheap. We should be able to dirty shared buffers but then
discard the data without ever writing it out to disk if we've logged a
discard of that data. Obviously, if you've got long-running
transactions that are either generating undo or holding old snapshots,
you're going to have to really flush the data, but we want to avoid
that when we can. And the same is true on the standby: even if we
write the dirty data into shared buffers instead of skipping the write
altogether, we hope to be able to forget about those buffers when we
encounter a discard record before the next checkpoint.

One idea we could consider, if it makes the code sufficiently simpler
and doesn't cost too much performance, is to remove the facility for
skipping over bytes to be written and instead write any bytes that we
don't really want to write to an entirely-fake buffer (e.g. a
backend-private page in a static variable). That seems a little silly
to me; I suspect there's a better way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ibrar Ahmed 2019-08-17 16:42:57 Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
Previous Message Bruce Momjian 2019-08-17 15:14:59 Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)