Re: POC: Cleaning up orphaned files using undo logs

From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2019-08-08 13:31:06
Message-ID: 20190808133106.mirjmhls465kecda@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2019-08-07 14:50:17 +0530, Amit Kapila wrote:
> On Tue, Aug 6, 2019 at 1:26 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On 2019-08-05 11:29:34 -0700, Andres Freund wrote:
> > > +/*
> > > + * Binary heap comparison function to compare the time at which an error
> > > + * occurred for transactions.
> > > + *
> > > + * The error queue is sorted by next_retry_at and err_occurred_at. Currently,
> > > + * the next_retry_at has some constant delay time (see PushErrorQueueElem), so
> > > + * it doesn't make much sense to sort by both values. However, in future, if
> > > + * we have some different algorithm for next_retry_at, then it will work
> > > + * seamlessly.
> > > + */
> >
> > Why is it useful to have error_occurred_at be part of the comparison at
> > all? If we need a tiebraker, err_occurred_at isn't that (if we can get
> > conflicts for next_retry_at, then we can also get conflicts in
> > err_occurred_at). Seems better to use something actually guaranteed to
> > be unique for a tiebreaker.
> >
>
> This was to distinguish the cases where the request is failing
> multiple times with the request failed the first time. I agree that
> we need a better unique identifier like FullTransactionid though. Do
> let me know if you have any other suggestion?

Sure, I get why you have the field. Even if it were just for debugging
or such. Was just commenting upon it being used as part of the
comparison. I'd just go for (next_retry_at, fxid).

> > > + * backends. This will ensure that it won't get filled.
> > > + */
> >
> > How does this ensure anything?
> >
>
> Because based on this we will have a hard limit on the number of undo
> requests after which we won't allow more requests. See some more
> detailed explanation for the same later in this email. I think the
> comment needs to be updated.

Well, as your code stands, I don't think there is an actual hard limit
on the number of transactions needing to be undone due to the way errors
are handled. There's no consideration of prepared transactions.

> > > + START_CRIT_SECTION();
> > > +
> > > + /* Update the progress in the transaction header. */
> > > + UndoRecordUpdateTransInfo(&context, 0);
> > > +
> > > + /* WAL log the undo apply progress. */
> > > + {
> > > + XLogRecPtr lsn;
> > > + xl_undoapply_progress xlrec;
> > > +
> > > + xlrec.urec_ptr = progress_urec_ptr;
> > > + xlrec.progress = block_num;
> > > +
> > > + XLogBeginInsert();
> > > + XLogRegisterData((char *) &xlrec, sizeof(xlrec));
> > > +
> > > + RegisterUndoLogBuffers(&context, 1);
> > > + lsn = XLogInsert(RM_UNDOACTION_ID, XLOG_UNDO_APPLY_PROGRESS);
> > > + UndoLogBuffersSetLSN(&context, lsn);
> > > + }
> > > +
> > > + END_CRIT_SECTION();
> > > +
> > > + /* Release undo buffers. */
> > > + FinishUndoRecordInsert(&context);
> > > +}
> >
> > This whole prepare/execute split for updating apply pregress, and next
> > undo pointers makes no sense to me.
> >
>
> Can you explain what is your concern here? Basically, in the prepare
> phase, we do read and lock the buffer and in the actual update phase
> (which is under critical section), we update the contents in the
> shared buffer. This is the same idea as we use in many places in
> code.

I'll comment on the concerns with the whole API separately.

> > > typedef struct TwoPhaseFileHeader
> > > {
> > > @@ -927,6 +928,16 @@ typedef struct TwoPhaseFileHeader
> > > uint16 gidlen; /* length of the GID - GID follows the header */
> > > XLogRecPtr origin_lsn; /* lsn of this record at origin node */
> > > TimestampTz origin_timestamp; /* time of prepare at origin node */
> > > +
> > > + /*
> > > + * We need the locations of the start and end undo record pointers when
> > > + * rollbacks are to be performed for prepared transactions using undo-based
> > > + * relations. We need to store this information in the file as the user
> > > + * might rollback the prepared transaction after recovery and for that we
> > > + * need its start and end undo locations.
> > > + */
> > > + UndoRecPtr start_urec_ptr[UndoLogCategories];
> > > + UndoRecPtr end_urec_ptr[UndoLogCategories];
> > > } TwoPhaseFileHeader;
> >
> > Why do we not need that knowledge for undo processing of a non-prepared
> > transaction?

> The non-prepared transaction also needs to be aware of that. It is
> stored in TransactionStateData. I am not sure if I understand your
> question here.

My concern is that I think it's fairly ugly to store data like this in
the 2pc state file. And it's not an insubstantial amount of additional
data either, compared to the current size, even when no undo is in
use. There's a difference between an unused feature increasing backend
local memory and increasing the size of WAL logged data. Obviously it's
not by a huge amount, but still. It also just feels wrong to me.

We don't need the UndoRecPtr's when recovering from a crash/restart to
process undo. Now we obviously don't want to unnecessarily search for
data that is expensive to gather, which is a good reason for keeping
track of this data. But I do wonder if this is the right approach.

I know that Robert is working on a patch that revises the undo request
layer somewhat, it's possible that this is best discussed afterwards.

> > > + case TBLOCK_UNDO:
> > > + /*
> > > + * We reach here when we got error while applying undo
> > > + * actions, so we don't want to again start applying it. Undo
> > > + * workers can take care of it.
> > > + *
> > > + * AbortTransaction is already done, still need to release
> > > + * locks and perform cleanup.
> > > + */
> > > + ResetUndoActionsInfo();
> > > + ResourceOwnerRelease(s->curTransactionOwner,
> > > + RESOURCE_RELEASE_LOCKS,
> > > + false,
> > > + true);
> > > + s->state = TRANS_ABORT;
> > > CleanupTransaction();
> >
> > Hm. Why is it ok that we only perform that cleanup action? Either the
> > rest of potentially held resources will get cleaned up somehow as well,
> > in which case this ResourceOwnerRelease() ought to be redundant, or
> > we're potentially leaking important resources like buffer pins, relcache
> > references and whatnot here?
> >
>
> I had initially used AbortTransaction() here for such things, but I
> was not sure whether that is the right thing when we reach here in
> this state. Because AbortTransaction is already done once we reach
> here. The similar thing happens for the TBLOCK_SUBUNDO state few
> lines below where I had used AbortSubTransaction. Now, one problem I
> faced when AbortSubTransaction got invoked in this code path was it
> internally invokes RecordTransactionAbort->XidCacheRemoveRunningXids
> which result in the error "did not find subXID %u in MyProc". The
> reason is obvious which is that we had already removed it when
> AbortSubTransaction was invoked before applying undo actions. The
> releasing of locks was the thing which we have delayed to allow undo
> actions to be applied which is done here. The other idea here I had
> was to call AbortTransaction/AbortSubTransaction but somehow avoid
> calling RecordTransactionAbort when in this state. Do you have any
> suggestion to deal with this?

Well, what I'm asking is how this possibly could be correct. Perhaps I'm
just missing something, in which case I don't yet want to make
suggestions for how this should look.

My concern is that you seem to have added a state where process quite a
lot of code - the undo actions, which use buffer pins, lwlocks,
sometimes heavyweight locks, potentially even relache, much more - but
we don't actually clean up any of those in case of error, *except* for
*some* resowner managed things. I just don't understand how that could
possibly be correct.

I'm also fairly certain that we had discussed that we can't actually
execute undo outside of a somewhat valid transaction environment - and
as far as I can tell, there's nothing of that here.

Even in the path without an error during UNDO, I see code like:
+ else
+ {
+ AtCleanup_Portals(); /* now safe to release portal memory */
+ AtEOXact_Snapshot(false, true); /* and release the transaction's
+ * snapshots */
+ s->fullTransactionId = InvalidFullTransactionId;
+ s->subTransactionId = TopSubTransactionId;
+ s->blockState = TBLOCK_UNDO;
+ }

without any comments why exactly these two cleanup callbacks need to be
called, and no others. See also below.

And then when UNDO errors out, I see:

+ for (i = 0; i < UndoLogCategories; i++)
+ {

+ PG_CATCH();
+ {
...
+ /* We should never reach here when we are in a semi-critical-section. */
+ Assert(SemiCritSectionCount == 0);
+ }
+ PG_END_TRY();

meaning that we'll just move on to undo the next persistency category
after an error. But there's absolutely no resource cleanup here. Which,
to me, means we'll very easily self-deadlock and things like
that. Consider an error thrown during undo, while holding an lwlock. If
the next persistence category acquires that lock again, we'll
self-deadlock. There's a lot of other similar issues.

So I just don't understand the current model of the xact.c
integration. That might be because I just don't understand the current
design, or because the current design is pretty broken.

> > > +{
> > > + TransactionState s = CurrentTransactionState;
> > > + bool result;
> > > + int i;
> > > +
> > > + /*
> > > + * We don't want to apply the undo actions when we are already cleaning up
> > > + * for FATAL error. See ReleaseResourcesAndProcessUndo.
> > > + */
> > > + if (SemiCritSectionCount > 0)
> > > + {
> > > + ResetUndoActionsInfo();
> > > + return;
> > > + }
> >
> > Wait what? Semi critical sections?
> >
>
> Robert up thread suggested this idea [1] (See paragraph starting with
> "I am not a fan of applying_subxact_undo....") to deal with cases
> where we get an error while applying undo actions and we need to
> promote the error to FATAL.

Well, my problem with this starts with the fact that I don't know see a
reason why we would want to promote subtransaction failures to FATAL. Or
why that would be OK - loosing reliability when using savepoints seems
pretty dubious to me. And sometimes we're can expect to get errors when
savepoints are in use, e.g. out-of-memory errors. And they're often
going to happen again during undo processing. So this isn't a "oh, it
never realistically happens" scenario imo.

There's two comments about this:

+We promote the error to FATAL error if it occurred while applying undo for a
+subtransaction. The reason we can't proceed without applying subtransaction's
+undo is that the modifications made in that case must not be visible even if
+the main transaction commits.

+ * (a) Subtransactions. We can't proceed without applying
+ * subtransaction's undo as the modifications made in that case must not
+ * be visible even if the main transaction commits. The reason why that
+ * can happen is because for undo-based AM's we don't need to have a
+ * separate transaction id for subtransactions and once the main
+ * transaction commits the tuples modified by subtransactions will become
+ * visible.

But that only means we can't allow such errors to be caught - there
should be a lot less harsh ways to achieve that than throwing a FATAL
error. We could e.g. just walk up the transaction stack and mark the
transaction levels as failed or something. So if somebody catches the
error, any database access done will just cause a failure again.

There's also:

+ * (b) Temp tables. We don't expect background workers to process undo of
+ * temporary tables as the same won't be accessible.

But I fail to see why that requires FATALing either. Isn't the worst
outcome here that we'll have some unnecessary undo around?

> > Give code like this I have a hard time seing what the point of having
> > separate queue entries for the different persistency levels is.

> It is not for this case, rather, it is for the case of discard worker
> (background worker) where we process the transactions at log level.
> The permanent and unlogged transactions will be in a separate log and
> can be encountered at different times, so this leads to having
> separate entries for them.

Given a hashtable over fxid, that doesn't seems like a
counter-argument. We can just do an fxid lookup, and if there's already
an entry, update it to reference the additional persistence level.

One question of understanding: Why do we ever want to register undo
requests for transactions that did not start in the log the discard
worker is currently looking at? It seems to me that there's some
complexity involved due to wanting to do that? We might have already
processed the portion of the transaction in the later log, but I don't
see why that'd be a problem?

>
> >
> > > +void
> > > +ReleaseResourcesAndProcessUndo(void)
> > > +{
> > > + TransactionState s = CurrentTransactionState;
> > > +
> > > + /*
> > > + * We don't want to apply the undo actions when we are already cleaning up
> > > + * for FATAL error. One of the main reasons is that we might be already
> > > + * processing undo actions for a (sub)transaction when we reach here
> > > + * (for ex. error happens while processing undo actions for a
> > > + * subtransaction).
> > > + */
> > > + if (SemiCritSectionCount > 0)
> > > + {
> > > + ResetUndoActionsInfo();
> > > + return;
> > > + }
> > > +
> > > + if (!NeedToPerformUndoActions())
> > > + return;
> > > +
> > > + /*
> > > + * State should still be TRANS_ABORT from AbortTransaction().
> > > + */
> > > + if (s->state != TRANS_ABORT)
> > > + elog(FATAL, "ReleaseResourcesAndProcessUndo: unexpected state %s",
> > > + TransStateAsString(s->state));
> > > +
> > > + /*
> > > + * Do abort cleanup processing before applying the undo actions. We must
> > > + * do this before applying the undo actions to remove the effects of
> > > + * failed transaction.
> > > + */
> > > + if (IsSubTransaction())
> > > + {
> > > + AtSubCleanup_Portals(s->subTransactionId);
> > > + s->blockState = TBLOCK_SUBUNDO;
> > > + }
> > > + else
> > > + {
> > > + AtCleanup_Portals(); /* now safe to release portal memory */
> > > + AtEOXact_Snapshot(false, true); /* and release the transaction's
> > > + * snapshots */
> >
> > Why do precisely these actions need to be performed here?
> >
>
> This is to get a transaction into a clean state. Before calling this
> function AbortTransaction has been performed and there were few more
> things we need to do for cleanup.

That doesn't answer my question. Why is it specifically these ones that
need to be called "manually"? Why no others? Where is that explained? I
assume you just copied them from CleanupTransaction() - but there's no
reference to that fact on either side, which means nobody would know to
keep them in sync.

I'll also note that the way it's currently set up, we don't delete the
transaction context before processing undo, at least as far as I can
see. Which seems that some OOM cases won't be able to roll back, even if
there'd be plenty memory except for the memory used by the
transaction. The portal cleanup will allow for some, but not all of
that, I think.

> > > +bool
> > > +ProcessUndoRequestForEachLogCat(FullTransactionId fxid, Oid dbid,
> > > + UndoRecPtr *end_urec_ptr, UndoRecPtr *start_urec_ptr,
> > > + bool *undoRequestResgistered, bool isSubTrans)
> > > +{
> > > + UndoRequestInfo urinfo;
> > > + int i;
> > > + uint32 save_holdoff;
> > > + bool success = true;
> > > +
> > > + for (i = 0; i < UndoLogCategories; i++)
> > > + {
> > > + if (end_urec_ptr[i] && !undoRequestResgistered[i])
> > > + {
> > > + save_holdoff = InterruptHoldoffCount;
> > > +
> > > + PG_TRY();
> > > + {
> > > + /* for subtransactions, we do partial rollback. */
> > > + execute_undo_actions(fxid,
> > > + end_urec_ptr[i],
> > > + start_urec_ptr[i],
> > > + !isSubTrans);
> > > + }
> > > + PG_CATCH();
> > > + {
> > > + /*
> > > + * Add the request into an error queue so that it can be
> > > + * processed in a timely fashion.
> > > + *
> > > + * If we fail to add the request in an error queue, then mark
> > > + * the entry status as invalid and continue to process the
> > > + * remaining undo requests if any. This request will be later
> > > + * added back to the queue by discard worker.
> > > + */
> > > + ResetUndoRequestInfo(&urinfo);
> > > + urinfo.dbid = dbid;
> > > + urinfo.full_xid = fxid;
> > > + urinfo.start_urec_ptr = start_urec_ptr[i];
> > > + if (!InsertRequestIntoErrorUndoQueue(&urinfo))
> > > + RollbackHTMarkEntryInvalid(urinfo.full_xid,
> > > + urinfo.start_urec_ptr);
> > > + /*
> > > + * Errors can reset holdoff count, so restore back. This is
> > > + * required because this function can be called after holding
> > > + * interrupts.
> > > + */
> > > + InterruptHoldoffCount = save_holdoff;
> > > +
> > > + /* Send the error only to server log. */
> > > + err_out_to_client(false);
> > > + EmitErrorReport();
> > > +
> > > + success = false;
> > > +
> > > + /* We should never reach here when we are in a semi-critical-section. */
> > > + Assert(SemiCritSectionCount == 0);
> >
> > This seems entirely and completely broken. You can't just catch an
> > exception and continue. What if somebody held an lwlock when the error
> > was thrown? A buffer pin?
> >
>
> The caller deals with that. For example, when this is called from
> FinishPreparedTransaction, we do AbortOutOfAnyTransaction and when
> called from ReleaseResourcesAndProcessUndo, we just release locks.

I don't see the caller being able to do anything here - the danger is
that a previous category of undo processing might have acquired
resources, and they're not cleaned up on failure, as you've set things up.

> Earlier here also, I had AbortTransaction but was not sure whether
> that is the right thing to do especially because it will lead to
> RecordTransactionAbort called twice, once when we do AbortTransaction
> before applying undo actions and once when we do it after catching the
> exception. Like as I said earlier maybe the right way is to just
> avoid calling RecordTransactionAbort again.

I think that "just" means that you've not divorced the state in which
undo processing is happening well enough from the "original"
transaction. I stand by my suggestion that what needs to happen is
roughly
1) re-assign locks from failed (sub-)transaction to a special "undo"
resource owner
2) completely abort (sub-)transaction
3) start a new (sub-)transaction
4) process undo
5) commit/abort that (sub-)transaction
6) release locks from "undo" resource owner

> > Nor why that's not a problem for:
> >
> > > +We have the hard limit (proportional to the size of the rollback hash table)
> > > +for the number of transactions that can have pending undo. This can help us
> > > +in computing the value of oldestXidHavingUnappliedUndo and allowing us not to
> > > +accumulate pending undo for a long time which will eventually block the
> > > +discard of undo.
> >
>
> The reason why it is not a problem is that we don't remove the entry
> from the hash table rather just mark it such that later discard worker
> can add it to the queues. I am not sure if I understood your question
> completely, but let me try to explain this idea in a bit more detail.
>
> The basic idea is that Rollback Hash Table has space equivalent to all
> the three queues plus (2 * MaxBackends). Now, we will stop allowing
> the new transactions that want to write undo once the hash table has
> entries equivalent to all three queues and we have 2 * Max_Backends
> already attached to undo logs that are not committed. Assume we have
> each queue size as 5 and Max_Backends =10, then ideally we can 35
> entries (3 * 5 + 2 * 10) in the hash table. The way all this is
> related to the error queue being full is like this:
>
> Say, we have a number of hash table entries equal to 15 which
> indicates all queues are full and now 10 backends connected to two
> different logs (permanent and unlogged). Next one of the transaction
> errors out and try to rollback, at this stage, it will add an entry in
> the hash table and try to execute the actions. While executing
> actions, it got an error and couldn't add to error queue because it
> was full, so at this stage, it just marks the hash table entry as
> invalid and proceeds (consider this happens for both logged and
> unlogged categories). So, at this stage, we will have 17 entries in
> the hash table and the other 9 backends attached to 18 logs which
> makes space for 35 xacts if the system crashes at this stage. The
> backend which errored out again tries to perform an operation for
> which it needs to perform undo. Now, we won't allow this backend to
> perform that action because if it crashed after performing the
> operation and before committing, the hash table will overflow.

What I don't understand is why there's any need for these "in hash
table, but not in any queue, and not being processed" type entries. All
that avoiding that seems to need is to make the error queue a bit
bigger?

> >
> > > + /* There might not be any undo log and hibernation might be needed. */
> > > + *hibernate = true;
> > > +
> > > + StartTransactionCommand();
> >
> > Why do we need this? I assume it's so we can have a resource owner?
> >
>
> Yes, and another reason is we are using dbid_exists in this function.

I think it'd be good to avoid needing any database access in both
discard worker, and undo launcher. They really shouldn't need catalog
access architecturally, and in the case of the discard worker we'd add
another process that'd potentially hold the xmin horizon down for a
while in some situations. We could of course add exceptions like we have
for vacuum, but I think we really shouldn't need that.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rafia Sabih 2019-08-08 13:42:33 Re: Resume vacuum and autovacuum from interruption and cancellation
Previous Message Yonatan Misgan 2019-08-08 13:29:08 RE: Locale support