Re: POC: Cleaning up orphaned files using undo logs

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2019-08-17 17:28:53
Message-ID: 20190817172853.n2ujr7dcshyidl4f@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2019-08-17 12:05:21 -0400, Robert Haas wrote:
> On Wed, Aug 14, 2019 at 12:39 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > > Again, I think it's not ok to just assume you can lock an essentially
> > > > unbounded number of buffers. This seems almost guaranteed to result in
> > > > deadlocks. And there's limits on how many lwlocks one can hold etc.
> > >
> > > I think for controlling that we need to put a limit on max prepared
> > > undo? I am not sure any other way of limiting the number of buffers
> > > because we must lock all the buffer in which we are going to insert
> > > the undo record under one WAL logged operation.
> >
> > I heard that a number of times. But I still don't know why that'd
> > actually be true. Why would it not be sufficient to just lock the buffer
> > currently being written to, rather than all buffers? It'd require a bit
> > of care updating the official current "logical end" of a log, but
> > otherwise ought to not be particularly hard? Only one backend can extend
> > the log after all, and until the log is externally visibily extended,
> > nobody can read or write those buffers, no?
>
> Well, I don't understand why you're on about this. We've discussed it
> a number of times but I'm still confused.

There's two reasons here:

The primary one in the context here is that if we do *not* have to lock
the buffers all ahead of time, we can simplify the interface. We
certainly can't lock the buffers over IO (due to buffer reclaim) as
we're doing right now, so we'd need another phase, called by the "user"
during undo insertion. But if we do not need to lock the buffers before
the insertion over all starts, the inserting location doesn't have to
care.

Secondarily, all the reasoning for needing to lock all buffers ahead of
time was imo fairly unconvincing. Following the "recipe" for WAL
insertions is a good idea when writing a new run-of-the-mill WAL
inserting location - but when writing a new fundamental facility, that
already needs to modify how WAL works, then I find that much less
convincing.

> 1. It's absolutely fine to just put a limit on this, because the
> higher-level facilities that use this shouldn't be doing a single
> WAL-logged operation that touches a zillion buffers. We have been
> careful to avoid having WAL-logged operations touch an unbounded
> number of buffers in plenty of other places, like the btree code, and
> we are going to have to be similarly careful here for multiple
> reasons, deadlock avoidance being one. So, saying, "hey, you're going
> to lock an unlimited number of buffers" is a straw man. We aren't.
> We can't.

Well, in the version of code that I was reviewing here, I don't there is
such a limit (there is a limit for buffers per undo record, but no limit
on the number of records inserted together). I think Dilip added a limit
since. And we have the issue of a lot of IO happening while holding
content locks on several pages. So I don't think it's a straw man at
all.

> 2. The write-ahead logging protocol says that you're supposed to lock
> all the buffers at once. See src/backend/access/transam/README. If
> you want to go patch that file, then this patch can follow whatever
> the locking rules in the patched version are. But until then, the
> patch should follow *the actual rules* not some other protocol based
> on a hand-wavy explanation in an email someplace. Otherwise, you've
> got the same sort of undocumented disaster-waiting-to-happen that you
> keep complaining about in other parts of this patch. We need fewer of
> those, not more!

But that's not what I'm asking for? I don't even know where you take
from that I don't want this to be documented. I'm mainly asking for a
comment explaining why the current behaviour is what it is. Because I
don't think an *implicit* "normal WAL logging rules" is sufficient
explanation, because all the locking here happens one or two layers away
from the WAL logging site - so it's absolutely *NOT* obvious that that's
the explanation. And I don't think any of the locking sites actually has
comments explaining why the locks are acquired at that time (in fact,
IIRC until the review some even only mentioned pinning, not locking).

> > > Suppose you insert one record for the transaction which split in
> > > block1 and 2. Now, before this block is actually going to the disk
> > > the transaction committed and become all visible the undo logs are
> > > discarded. It's possible that block 1 is completely discarded but
> > > block 2 is not because it might have undo for the next transaction.
> > > Now, during recovery (FPW is off) if block 1 is missing but block 2 is
> > > their so we need to skip inserting undo for block 1 as it does not
> > > exist.
> >
> > Hm. I'm quite doubtful this is a good idea. How will this not force us
> > to a emit a lot more expensive durable operations while writing undo?
> > And doesn't this reduce error detection quite remarkably?
> >
> > Thomas, Robert?
>
> I think you're going to need to spell out your assumptions in order
> for me to be able to comment intelligently. This is another thing
> that seems pretty normal to me. Generally, WAL replay might need to
> recreate objects whose creation is not separately WAL-logged, and it
> might need to skip operations on objects that have been dropped later
> in the WAL stream and thus don't exist any more. This seems like an
> instance of the latter pattern. There's no reason to try to put valid
> data into pages that we know have been discarded, and both inserting
> and discarding undo data need to be logged anyway.

Yea, I was "intentionally" vague here. I didn't have a concrete scenario
that I was concerned about, but it somehow didn't quite seem right, and
I didn't encounter an explanation why it's guaranteed to be safe. So
more eyes seemed like a good idea. I'm not at all sure that there is an
actual problem here - I'm mostly trying to understand this code, from
the perspective of somebody reading it for the first time.

I think what primarily makes me concerned is that it's not clear to me
what guarantees that discard is the only reason for the block to
potentially be missing. I contrast to most other similar cases where WAL
replay simply re-creates the objects when trying to replay an action
affecting such an object, here we simply skip over the WAL logged
operation. So if e.g. the entire underlying UNDO file got lost, we
neither re-create it with valid content, nor error out. Which means we
got to be absolutely sure that all undo files are created in a
persistent manner, at their full size. And that there's no way that data
could get lost, without forcing us to perform REDO up to at least the
relevant point again.

While it appears that we always WAL log the undo extension, I am not
convinced the recovery interlock is strong enough. For one
UndoLogDiscard() unlinks segments before WAL logging their removal -
which means if we crash after unlink() and before the
XLogInsert(XLOG_UNDOLOG_DISCARD) we'd theoretically be in trouble (in
practice we might be fine, because there ought to be nobody still
referencing that UNDO - but I don't think that's actually guaranteed as
is). Nor do I see where we're updating minRecoveryLocation when
replaying a XLOG_UNDOLOG_DISCARD, which means that a restart during
recovery could be stopped before the discard has been replayed, leaving
us with wrong UNDO, but allowing write acess. Seems we'd at least need a
few more XLogFlush() calls.

> One idea we could consider, if it makes the code sufficiently simpler
> and doesn't cost too much performance, is to remove the facility for
> skipping over bytes to be written and instead write any bytes that we
> don't really want to write to an entirely-fake buffer (e.g. a
> backend-private page in a static variable). That seems a little silly
> to me; I suspect there's a better way.

I suspect so too.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-08-17 17:41:18 max_parallel_workers can't actually be set?
Previous Message Ibrar Ahmed 2019-08-17 17:10:45 Re: Patch: New GUC prepared_statement_limit to limit memory used by prepared statements