Re: POC: Cleaning up orphaned files using undo logs

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2019-08-13 08:23:59
Message-ID: CAFiTN-uqBQwKggJn=JtgFH8P2WQqLse6YysGpDAJv3PsbGi+Ew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 30, 2019 at 1:32 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> Amit, short note: The patches aren't attached in patch order. Obviously
> a miniscule thing, but still nicer if that's not the case.
>
> Dilip, this also contains the start of a review for the undo record
> interface further down.

> > Subject: [PATCH 07/14] Provide interfaces to store and fetch undo records.
> >
> > Add the capability to form undo records and store them in undo logs. We
> > also provide the capability to fetch the undo records. This layer will use
> > undo-log-storage to reserve the space for the undo records and buffer
> > management routines to write and read the undo records.
> >
>
> > Undo records are stored in sequential order in the undo log.
>
> Maybe "In each und log undo records are stored in sequential order."?
Done
>
>
>
> > +++ b/src/backend/access/undo/README.undointerface
> > @@ -0,0 +1,29 @@
> > +Undo record interface layer
> > +---------------------------
> > +This is the next layer which sits on top of the undo log storage, which will
> > +provide an interface for prepare, insert, or fetch the undo records. This
> > +layer will use undo-log-storage to reserve the space for the undo records
> > +and buffer management routine to write and read the undo records.
>
> The reference to "undo log storage" kinda seems like a reference into
> nothingness...
Changed
>
>
> > +Writing an undo record
> > +----------------------
> > +To prepare an undo record, first, it will allocate required space using
> > +undo log storage module. Next, it will pin and lock the required buffers and
> > +return an undo record pointer where it will insert the record. Finally, it
> > +calls the Insert routine for final insertion of prepared record. Additionally,
> > +there is a mechanism for multi-insert, wherein multiple records are prepared
> > +and inserted at a time.
>
> I'm not sure whta this is telling me. Who is "it"?
>
> To me the filename ("interface"), and the title of this section,
> suggests this provides documentation on how to write code to insert undo
> records. But I don't think this does.
I have improved it
>
>
> > +Fetching and undo record
> > +------------------------
> > +To fetch an undo record, a caller must provide a valid undo record pointer.
> > +Optionally, the caller can provide a callback function with the information of
> > +the block and offset, which will help in faster retrieval of undo record,
> > +otherwise, it has to traverse the undo-chain.
>
> > +There is also an interface to bulk fetch the undo records. Where the caller
> > +can provide a TO and FROM undo record pointer and the memory limit for storing
> > +the undo records. This API will return all the undo record between FROM and TO
> > +undo record pointers if they can fit into provided memory limit otherwise, it
> > +return whatever can fit into the memory limit. And, the caller can call it
> > +repeatedly until it fetches all the records.
>
> There's a lot of terminology in this file that's not been introduced. I
> think this needs to be greatly expanded and restructured to allow people
> unfamiliar with the code to benefit.
I have improved it, but I think still I need to work on it to
introduce the terminology used.
>
>
> > +/*-------------------------------------------------------------------------
> > + *
> > + * undoaccess.c
> > + * entry points for inserting/fetching undo records
>
> > + * NOTES:
> > + * Undo record layout:
> > + *
> > + * Undo records are stored in sequential order in the undo log. Each undo
> > + * record consists of a variable length header, tuple data, and payload
> > + * information.
>
> Is that actually true? There's records without tuples, no?
Right, changed this
>
> > The first undo record of each transaction contains a
> > + * transaction header that points to the next transaction's start
> > header.
>
> Seems like this needs to reference different persistence levels,
> otherwise it seems misleading, given there can be multiple first records
> in multiple undo logs?
I have changed it.
>
>
> > + * This allows us to discard the entire transaction's log at one-shot
> > rather
>
> s/at/in/
Fixed
>
> > + * than record-by-record. The callers are not aware of transaction header,
>
> s/of/of the/
Fixed
>
> > + * this is entirely maintained and used by undo record layer. See
>
> s/this/it/
Fixed
>
> > + * undorecord.h for detailed information about undo record header.
>
> s/undo record/the undo record/
Fixed
>
>
> I think at the very least there's explanations missing for:
> - what is the locking protocol for multiple buffers
> - what are the contexts for insertion
> - what phases an undo insertion happens in
> - updating previous records in general
> - what "packing" actually is
>
>
> > +
> > +/* Prototypes for static functions. */
>
>
> Don't think we commonly include that...
Changed, removed all unwanted prototypes
>
> > +static UnpackedUndoRecord *UndoGetOneRecord(UnpackedUndoRecord *urec,
> > + UndoRecPtr urp, RelFileNode rnode,
> > + UndoPersistence persistence,
> > + Buffer *prevbuf);
> > +static int UndoRecordPrepareTransInfo(UndoRecordInsertContext *context,
> > + UndoRecPtr xact_urp, int size, int offset);
> > +static void UndoRecordUpdateTransInfo(UndoRecordInsertContext *context,
> > + int idx);
> > +static void UndoRecordPrepareUpdateNext(UndoRecordInsertContext *context,
> > + UndoRecPtr urecptr, UndoRecPtr xact_urp);
> > +static int UndoGetBufferSlot(UndoRecordInsertContext *context,
> > + RelFileNode rnode, BlockNumber blk,
> > + ReadBufferMode rbm);
> > +static uint16 UndoGetPrevRecordLen(UndoRecPtr urp, Buffer input_buffer,
> > + UndoPersistence upersistence);
> > +
> > +/*
> > + * Structure to hold the prepared undo information.
> > + */
> > +struct PreparedUndoSpace
> > +{
> > + UndoRecPtr urp; /* undo record pointer */
> > + UnpackedUndoRecord *urec; /* undo record */
> > + uint16 size; /* undo record size */
> > + int undo_buffer_idx[MAX_BUFFER_PER_UNDO]; /* undo_buffer array
> > + * index */
> > +};
> > +
> > +/*
> > + * This holds undo buffers information required for PreparedUndoSpace during
> > + * prepare undo time. Basically, during prepare time which is called outside
> > + * the critical section we will acquire all necessary undo buffers pin and lock.
> > + * Later, during insert phase we will write actual records into thse buffers.
> > + */
> > +struct PreparedUndoBuffer
> > +{
> > + UndoLogNumber logno; /* Undo log number */
> > + BlockNumber blk; /* block number */
> > + Buffer buf; /* buffer allocated for the block */
> > + bool zero; /* new block full of zeroes */
> > +};
>
> Most files define datatypes before function prototypes, because
> functions may reference the datatypes.
done
>
>
> > +/*
> > + * Prepare to update the transaction header
> > + *
> > + * It's a helper function for PrepareUpdateNext and
> > + * PrepareUpdateUndoActionProgress
>
> This doesn't really explain much. PrepareUpdateUndoActionProgress
> doesnt' exist. I assume it's UndoRecordPrepareApplyProgress from 0012?
Enhanced the comments
>
>
> > + * xact_urp - undo record pointer to be updated.
> > + * size - number of bytes to be updated.
> > + * offset - offset in undo record where to start update.
> > + */
>
> These comments seem redundant with the parameter names.
fixed
>
>
> > +static int
> > +UndoRecordPrepareTransInfo(UndoRecordInsertContext *context,
> > + UndoRecPtr xact_urp, int size, int offset)
> > +{
> > + BlockNumber cur_blk;
> > + RelFileNode rnode;
> > + int starting_byte;
> > + int bufidx;
> > + int index = 0;
> > + int remaining_bytes;
> > + XactUndoRecordInfo *xact_info;
> > +
> > + xact_info = &context->xact_urec_info[context->nxact_urec_info];
> > +
> > + UndoRecPtrAssignRelFileNode(rnode, xact_urp);
> > + cur_blk = UndoRecPtrGetBlockNum(xact_urp);
> > + starting_byte = UndoRecPtrGetPageOffset(xact_urp);
> > +
> > + /* Remaining bytes on the current block. */
> > + remaining_bytes = BLCKSZ - starting_byte;
> > +
> > + /*
> > + * Is there some byte of the urec_next on the current block, if not then
> > + * start from the next block.
> > + */
>
> This comment needs rephrasing.
Done
>
>
> > + /* Loop until we have fetched all the buffers in which we need to write. */
> > + while (size > 0)
> > + {
> > + bufidx = UndoGetBufferSlot(context, rnode, cur_blk, RBM_NORMAL);
> > + xact_info->idx_undo_buffers[index++] = bufidx;
> > + size -= (BLCKSZ - starting_byte);
> > + starting_byte = UndoLogBlockHeaderSize;
> > + cur_blk++;
> > + }
>
> So, this locks a very large number of undo buffers at the same time, do
> I see that correctly? What guarantees that there are no deadlocks due
> to multiple buffers locked at the same time (I guess the order inside
> the log)? What guarantees that this is a small enough number that we can
> even lock all of them at the same time?

I think we are locking them in the block order and that should avoid
the deadlock. I have explained in the comments.

>
> Why do we need to lock all of them at the same time? That's not clear to
> me.

Because this is called outside the critical section so we keep all the
buffers locked what we want to update inside the critical section for
single wal record.
>
> Also, why do we need code to lock an unbounded number here? It seems
> hard to imagine we'd ever want to update more than something around 8
> bytes? Shouldn't that at the most require two buffers?
Right, it should lock at the most 2 buffers. Now, I have added assert
for that. Basically, it can either lock 1 buffer or 2 buffers so I am
not sure what is the best condition to break the loop. I guess our
target is to write 8 bytes so breaking condition must be the number of
bytes. I agree that we should never go beyond two buffers but for
that, we can add an assert. Do you have another opinion on this?

>
>
> > +/*
> > + * Prepare to update the previous transaction's next undo pointer.
> > + *
> > + * We want to update the next transaction pointer in the previous transaction's
> > + * header (first undo record of the transaction). In prepare phase we will
> > + * unpack that record and lock the necessary buffers which we are going to
> > + * overwrite and store the unpacked undo record in the context. Later,
> > + * UndoRecordUpdateTransInfo will overwrite the undo record.
> > + *
> > + * xact_urp - undo record pointer of the previous transaction's header
> > + * urecptr - current transaction's undo record pointer which need to be set in
> > + * the previous transaction's header.
> > + */
> > +static void
> > +UndoRecordPrepareUpdateNext(UndoRecordInsertContext *context,
> > + UndoRecPtr urecptr, UndoRecPtr xact_urp)
>
> That name imo is confusing - it's not clear that it's not actually about
> the next record or such.
I agree. I think I will think about what to name it. I am planning
to unify 2 function UndoRecordPrepareUpdateNext and
PrepareUpdateUndoActionProgress then we can directly name it
PrepareUndoRecordUpdate. But for that, I need to get the progress
update code in my patch.

>
>
> > +{
> > + UndoLogSlot *slot;
> > + int index = 0;
> > + int offset;
> > +
> > + /*
> > + * The absence of previous transaction's undo indicate that this backend
>
> *indicates
>
Done
>
> > + /*
> > + * Acquire the discard lock before reading the undo record so that discard
> > + * worker doesn't remove the record while we are in process of reading it.
> > + */
>
> *the discard worker
Done
>
>
> > + LWLockAcquire(&slot->discard_update_lock, LW_SHARED);
> > + /* Check if it is already discarded. */
> > + if (UndoLogIsDiscarded(xact_urp))
> > + {
> > + /* Release lock and return. */
> > + LWLockRelease(&slot->discard_update_lock);
> > + return;
> > + }
>
> Ho, hum. I don't quite remember what we decided in the discussion about
> not having to use the discard lock for this purpose.
I think we haven't concluded an alternative solution for this and
planned to keep it as is for now. Please correct me if anyone else
has a different opinion.

>
>
> > + /* Compute the offset of the uur_next in the undo record. */
> > + offset = SizeOfUndoRecordHeader +
> > + offsetof(UndoRecordTransaction, urec_next);
> > +
> > + index = UndoRecordPrepareTransInfo(context, xact_urp,
> > + sizeof(UndoRecPtr), offset);
> > + /*
> > + * Set the next pointer in xact_urec_info, this will be overwritten in
> > + * actual undo record during update phase.
> > + */
> > + context->xact_urec_info[index].next = urecptr;
>
> What does "this will be overwritten mean"? It sounds like "context->xact_urec_info[index].next"
> would be overwritten, but that can't be true.
>
>
> > + /* We can now release the discard lock as we have read the undo record. */
> > + LWLockRelease(&slot->discard_update_lock);
> > +}
>
> Hm. Because you expect it to be blocked behind the content lwlocks for
> the buffers?
Yes, I added comments.
>
>
> > +/*
> > + * Overwrite the first undo record of the previous transaction to update its
> > + * next pointer.
> > + *
> > + * This will insert the already prepared record by UndoRecordPrepareTransInfo.
>
> It doesn't actually appear to insert any records. At least not a record
> in the way the rest of the file uses that term?

I think this was old comments. Fixed it.
>
>
> > + * This must be called under the critical section.
>
> s/under the/in a/
I think I missed in my last patch. Will fix in next version.
>
> Think that should be asserted.
Added the assert.
>
>
> > + /*
> > + * Start writing directly from the write offset calculated during prepare
> > + * phase. And, loop until we write required bytes.
> > + */
>
> Why do we do offset calculations multiple times? Seems like all the
> offsets, and the split, should be computed in exactly one place.
Sorry, I did not understand this, we are calculating the offset in
the prepare phase. Do you want to point out something else?

>
>
> > +/*
> > + * Find the block number in undo buffer array
> > + *
> > + * If it is present then just return its index otherwise search the buffer and
> > + * insert an entry and lock the buffer in exclusive mode.
> > + *
> > + * Undo log insertions are append-only. If the caller is writing new data
> > + * that begins exactly at the beginning of a page, then there cannot be any
> > + * useful data after that point. In that case RBM_ZERO can be passed in as
> > + * rbm so that we can skip a useless read of a disk block. In all other
> > + * cases, RBM_NORMAL should be passed in, to read the page in if it doesn't
> > + * happen to be already in the buffer pool.
> > + */
> > +static int
> > +UndoGetBufferSlot(UndoRecordInsertContext *context,
> > + RelFileNode rnode,
> > + BlockNumber blk,
> > + ReadBufferMode rbm)
> > +{
> > + int i;
> > + Buffer buffer;
> > + XLogRedoAction action = BLK_NEEDS_REDO;
> > + PreparedUndoBuffer *prepared_buffer;
> > + UndoPersistence persistence = context->alloc_context.persistence;
> > +
> > + /* Don't do anything, if we already have a buffer pinned for the block. */
>
> As the code stands, it's locked, not just pinned.
Changed

>
>
> > + for (i = 0; i < context->nprepared_undo_buffer; i++)
> > + {
>
> How large do we expect this to get at most?
>
In BeginUndoRecordInsert we are computing it

+ /* Compute number of buffers. */
+ nbuffers = (nprepared + MAX_UNDO_UPDATE_INFO) * MAX_BUFFER_PER_UNDO;

>
> > + /*
> > + * We did not find the block so allocate the buffer and insert into the
> > + * undo buffer array.
> > + */
> > + if (InRecovery)
> > + action = XLogReadBufferForRedoBlock(context->alloc_context.xlog_record,
> > + SMGR_UNDO,
> > + rnode,
> > + UndoLogForkNum,
> > + blk,
> > + rbm,
> > + false,
> > + &buffer);
>
> Why is not locking the buffer correct here? Can't there be concurrent
> reads during hot standby?
because XLogReadBufferForRedoBlock is locking it internally. I have
added this in coment in new patch.
>
>
> > +/*
> > + * This function must be called before all the undo records which are going to
> > + * get inserted under a single WAL record.
>
> How can a function be called "before all the undo records"?

"before all the undo records which are getting inserted under single
WAL" because it will set the prepare limit and allocate appropriate
memory for that.
So I am not sure what is your point here? why can't we call it before
all the undo record we are inserting?

>
> > + * nprepared - This defines the max number of undo records that can be
> > + * prepared before inserting them.
> > + */
> > +void
> > +BeginUndoRecordInsert(UndoRecordInsertContext *context,
> > + UndoPersistence persistence,
> > + int nprepared,
> > + XLogReaderState *xlog_record)
>
> There definitely needs to be explanation about xlog_record. But also
> about memory management etc. Looks like one e.g. can't call this from a
> short lived memory context.
I have added coments for this.
>
>
> > +/*
> > + * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
> > + * intended to insert. Upon return, the necessary undo buffers are pinned and
> > + * locked.
>
> Again, how is deadlocking / max number of buffers handled, and why do
> they all need to be locked at the same time?
>
>
> > + /*
> > + * We don't yet know if this record needs a transaction header (ie is the
> > + * first undo record for a given transaction in a given undo log), because
> > + * you can only find out by allocating. We'll resolve this circularity by
> > + * allocating enough space for a transaction header. We'll only advance
> > + * by as many bytes as we turn out to need.
> > + */
>
> Why can we only find this out by allocating? This seems like an API
> deficiency of the storage layer to me. The information is in the und log
> slot's metadata, no?

I agree with this. I think if Thomas agree we can provide an API in
undo log which can provide us this information before we do the actual
allocation.

>
>
> > + urec->uur_next = InvalidUndoRecPtr;
> > + UndoRecordSetInfo(urec);
> > + urec->uur_info |= UREC_INFO_TRANSACTION;
> > + urec->uur_info |= UREC_INFO_LOGSWITCH;
> > + size = UndoRecordExpectedSize(urec);
> > +
> > + /* Allocate space for the record. */
> > + if (InRecovery)
> > + {
> > + /*
> > + * We'll figure out where the space needs to be allocated by
> > + * inspecting the xlog_record.
> > + */
> > + Assert(context->alloc_context.persistence == UNDO_PERMANENT);
> > + urecptr = UndoLogAllocateInRecovery(&context->alloc_context,
> > + XidFromFullTransactionId(txid),
> > + size,
> > + &need_xact_header,
> > + &last_xact_start,
> > + &prevlog_xact_start,
> > + &prevlogurp);
> > + }
> > + else
> > + {
> > + /* Allocate space for writing the undo record. */
>
> That's basically the same comment as before the if.
Removed
>
>
> > + urecptr = UndoLogAllocate(&context->alloc_context,
> > + size,
> > + &need_xact_header, &last_xact_start,
> > + &prevlog_xact_start, &prevlog_insert_urp);
> > +
> > + /*
> > + * If prevlog_xact_start is a valid undo record pointer that means
> > + * this transaction's undo records are split across undo logs.
> > + */
> > + if (UndoRecPtrIsValid(prevlog_xact_start))
> > + {
> > + uint16 prevlen;
> > +
> > + /*
> > + * If undo log is switch during transaction then we must get a
>
> "is switch" is right.

This code is removed now.
>
> > +/*
> > + * Insert a previously-prepared undo records.
>
> s/a//
Fixed
>
>
> More tomorrow.
>

refer the latest patch at
https://www.postgresql.org/message-id/CAFiTN-uf4Bh0FHwec%2BJGbiLq%2Bj00V92W162SLd_JVvwW-jwREg%40mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2019-08-13 08:27:10 Re: Global temporary tables
Previous Message Craig Ringer 2019-08-13 08:21:58 Re: Global temporary tables