Re: POC: Cleaning up orphaned files using undo logs

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2019-08-16 04:42:34
Message-ID: CA+hUKG+xeQzYHTryPFSkgQPQ=SNCcbiF0-4gWmv7atamJXW9uQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Amit,

I've combined three of your messages into one below, and responded
inline. New patch set to follow shortly, with the fixes listed below
(and others from other reviewers).

On Wed, Jul 24, 2019 at 9:15 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> 0003-Add-undo-log-manager.patch
> 1.
> allocate_empty_undo_segment()
...
> + /* create two parents up if not exist */
> + parentdir = pstrdup(undo_path);
> + get_parent_directory(parentdir);
> + get_parent_directory(parentdir);
> + /* Can't create parent and it doesn't already exist? */
> + if (mkdir(parentdir, S_IRWXU) < 0 && errno != EEXIST)
>
>
> All of this code is almost same as we have code in
> TablespaceCreateDbspace still we have small differences like here you
> are using mkdir instead of MakePGDirectory which as far as I can see
> use similar permissions for creating directory. Also, it checks
> whether the directory exists before trying to create it. Is there a
> reason why we need to do a few things differently here if not, they
> can both the places use one common function?

Right, MakePGDirectory() arrived in commit da9b580d8990, and should
probably be used everywhere we create directories under pgdata.
Fixed.

Yeah, I think we could just use TablespaceCreateDbspace() for this, if
we are OK with teaching GetDatabasePath() and GetRelationPath() about
how to make special undo paths, OR we are OK with just using
"standard" paths, where undo files just live under database 9 (instead
of the special "undo" directory). I stopped using a "9" directory in
earlier versions because undo moved to a separate namespace when we
agreed to use an extra discriminator in buffer tags and so forth; now
that we're back to using database number 9, the question of whether to
reflect that on the filesystem is back. I have had some trouble
deciding which parts of the system should treat undo logs as some kind
of 'relation' (and the SLRU project will have to tackle the same
questions). I'll think about that some more before making the change.

> 2.
> allocate_empty_undo_segment()
> {
> ..
> ..
> /* Flush the contents of the file to disk before the next checkpoint. */
> + undofile_request_sync(logno, end / UndoLogSegmentSize, tablespace);
> ..
> }

> The comment in allocate_empty_undo_segment indicates that the code
> wants to flush before checkpoint, but the actual function tries to
> register the request with checkpointer. Shouldn't this be similar to
> XLogFileInit where we use pg_fsync to flush the contents immediately?

I responded to the general question about when we sync files in an
earlier email. I've updated the comments to make it clearer that it's
handing the work off, not doing it now.

> Another thing is that recently in commit 475861b261 (commit by you),
> we have introduced a mechanism to not fill the files with zero's for
> certain filesystems like ZFS. Do we want similar behavior for undo
> files?

Good point. I will create a separate thread to discuss how the
creation of a central file allocation routine (possibly with a GUC),
and see if we can come up with something reusable for this, but
independently committable.

> 3.
> +extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
> +{
> + UndoLogSlot *slot;
> + size_t end;
> +
> + slot = find_undo_log_slot(logno, false);
> +
> + /* TODO review interlocking */
> +
> + Assert(slot != NULL);
> + Assert(slot->meta.end % UndoLogSegmentSize == 0);
> + Assert(new_end % UndoLogSegmentSize == 0);
> + Assert(InRecovery ||
> + CurrentSession->attached_undo_slots[slot->meta.category] == slot);
>
> Can you write some comments explaining the above Asserts? Also, can
> you explain what interlocking issues are you worried about here?

I added comments about the assertions. I will come back to the
interlocking in another message, which I've now addressed (alluded to
below as well).

> 4.
> while (end < new_end)
> + {
> + allocate_empty_undo_segment(logno, slot->meta.tablespace, end);
> + end += UndoLogSegmentSize;
> + }
> +
> + /* Flush the directory entries before next checkpoint. */
> + undofile_request_sync_dir(slot->meta.tablespace);
>
> I see that at two places after allocating empty undo segment, the
> patch performs undofile_request_sync_dir whereas it doesn't perform
> the same in UndoLogNewSegment? Is there a reason for the same or is it
> missed from one of the places?

You're right. Done.

> 5.
> +static void
> +extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
> {
> ..
> /*
> + * We didn't need to acquire the mutex to read 'end' above because only
> + * we write to it. But we need the mutex to update it, because the
> + * checkpointer might read it concurrently.
>
> Is this assumption correct? It seems patch also modified
> slot->meta.end during discard in function UndoLogDiscard. I am
> referring below code:
>
> +UndoLogDiscard(UndoRecPtr discard_point, TransactionId xid)
> {
> ..
> + /* Update shmem to show the new discard and end pointers. */
> + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
> + slot->meta.discard = discard;
> + slot->meta.end = end;
> + LWLockRelease(&slot->mutex);
> ..
> }

Yeah, the assumption was wrong, and that's what that other TODO note
about interlocking was referring to. I have redesigned this so that
there is a separate per-undo log extend_lock that allows
UndoLogDiscard() (in a background worker or superuser command) and
UndoLogAllocate() to serialise extension of the undo log. Hopefully
foreground processes don't often have to wait (a discard worker will
recycle segments fast enough), but if it ever does have to wait, it's
waiting for another backend to rename() a fully allocated file, which
is hopefully still better than writing a load of zeroes into a new
file.

> 6.
> extend_undo_log()
> {
> ..
> ..
> if (!InRecovery)
> + {
> + xl_undolog_extend xlrec;
> + XLogRecPtr ptr;
> +
> + xlrec.logno = logno;
> + xlrec.end = end;
> +
> + XLogBeginInsert();
> + XLogRegisterData((char *) &xlrec, sizeof(xlrec));
> + ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_EXTEND);
> + XLogFlush(ptr);
> + }
>
> It is not obvious to me why we need to perform XLogFlush here, can you explain?

It's not needed, and I've removed it. The important thing here is
that we insert the record after creating the files and telling the
checkpointer to flush them; there's no benefit to flushing the WAL
record.

There are three crash recovery possibilities: (1) We recover from a
checkpoint after this record, and the files are already durable, (2)
we recover from a checkpoint before this record, replay this record,
the file(s) may or may not be present but we'll tolerate them if they
are and overwrite, (3) we recover from a checkpoint before this record
was written, and this WAL record is never replayed because it wasn't
flushed, and then there may or may not be some orphaned files but
we'll eventually try to create files with the same names as we extend
the undo log and tolerate their existence.

> 7.
> +attach_undo_log(UndoLogCategory category, Oid tablespace)
> {
> ..
> if (candidate->meta.tablespace == tablespace)
> + {
> + logno = *place;
> + slot = candidate;
> + *place = candidate->next_free;
> + break;
> + }
>
> Here, the code is breaking from the loop, so why do we need to set
> *place? Am I missing something obvious?

(See further down).

> 8.
> + /* WAL-log the creation of this new undo log. */
> + {
> + xl_undolog_create xlrec;
> +
> + xlrec.logno = logno;
> + xlrec.tablespace = slot->meta.tablespace;
> + xlrec.category = slot->meta.category;
> +
> + XLogBeginInsert();
> + XLogRegisterData((char *) &xlrec, sizeof(xlrec));
>
> Here and in most other places in this patch you are using
> sizeof(xlrec) for xlog static data. However, as far as I know in
> other places in the code we define the size using offset of the last
> parameter of corresponding structure to avoid any inconsistency in WAL
> record size across different platforms. Is there a reason to go
> differently with this patch? See below one for example:
>
> typedef struct xl_hash_add_ovfl_page
> {
> uint16 bmsize;
> bool bmpage_found;
> } xl_hash_add_ovfl_page;
>
> #define SizeOfHashAddOvflPage
> \
> (offsetof(xl_hash_add_ovfl_page, bmpage_found) + sizeof(bool))

I see. Apparently we don't always do that:

tmunro(at)dogmatix $ git grep RegisterData | grep sizeof | wc -l
60
tmunro(at)dogmatix $ git grep RegisterData | grep Size | wc -l
63

I've now done it for all of these structs so that we trim the padding
in some cases, even though in some cases it'll make no difference.

> 9.
> +static void
> +undolog_xlog_create(XLogReaderState *record)
> +{
> + xl_undolog_create *xlrec = (xl_undolog_create *) XLogRecGetData(record);
> + UndoLogSlot *slot;
> +
> + /* Create meta-data space in shared memory. */
> + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
> +
> + /* TODO: assert that it doesn't exist already? */
> +
> + slot = allocate_undo_log_slot();
> + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
>
> Why do we need to acquire locks during recovery?

Mostly because allocate_undo_log_slot() asserts that the lock is held.
We probably don't need this in recovery but it doesn't seem like a
problem, it'll never be contended.

> 10.
> I think UndoLogAllocate can leak allocation of slots. It first
> allocates the slot for a new log from the free pool in there is no
> existing slot/log, writes a WAL record and then at a later point of
> time it actually creates the required physical space in the log via
> extend_undo_log which also writes a separate WAL. Now, if there is a
> error between these two operations, then we will have a redundant slot
> allocated. What if there are repeated errors for similar thing from
> multiple backends after which system crashes. Now, after restart, we
> will allocate multiple slots for different lognos which don't have any
> actual (physical) logs. This might not be a big problem in practice
> because the chances of error between two operations are less, but
> can't we delay the WAL logging for allocation of a slot for a new log.

I don't think it leaks anything, and the undo log is not redundant,
it's free/available. An undo log is allowed to have no space
allocated (discard == end). In fact that can happen a few different
ways: for example after crash recovery, we unlink all files belonging
to unlogged undo logs by scanning the filesystem, and set discard ==
end, on a segment boundary. That's also the way new undo logs are
born, and I think that's OK. If you crash and recover up to the point
the undo log creation was WAL-logged, you'll now have a log with no
space, and then the first person to try to allocate something in it
will extend it (= create the file, move the end pointer) in the
process of allocating space.

> 11.
> +UndoLogAllocate()
> {
> ..
> ..
> + /*
> + * Maintain our tracking of the and the previous transaction start
> + * locations.
> + */
> + if (slot->meta.unlogged.this_xact_start != slot->meta.unlogged.insert)
> + {
> + slot->meta.unlogged.last_xact_start =
> + slot->meta.unlogged.this_xact_start;
> + slot->meta.unlogged.this_xact_start = slot->meta.unlogged.insert;
> + }
>
> ".. of the and the ..", after first the, something is missing.

Fixed.

> 12.
> UndoLogAllocate()
> {
> ..
> ..
> + /*
> + * We don't need to acquire log->mutex to read log->meta.insert and
> + * log->meta.end, because this backend is the only one that can
> + * modify them.
> + */
> + if (unlikely(new_insert > slot->meta.end))
>
> I might be confused but slot->meta.end is modified by discard process
> also, so how is it safe? If so, may be adding a comment to explain
> the same would be good. Also, I think in the comments log should be
> replaced with the slot.

Right, now fixed.

I fixed s/log->/slot->/ here and elsewhere in comments.

> 13.
> UndoLogAllocate()
> {
> ..
> + /* This undo log is entirely full. Get a new one. */
> + if (logxid == GetTopTransactionId())
> + {
> + /*
> + * If the same transaction is split over two undo logs then
> + * store the previous log number in new log. See detailed
> + * comments in undorecord.c file header.
> + */
> ..
> }
>
> The undorecord.c should be renamed to undoaccess.c

Fixed.

> 14.
> UndoLogAllocate()
> {
> ..
> + if (logxid != GetTopTransactionId())
> + {
> + /*
> + * While we have the lock, check if we have been forcibly detached by
> + * DROP TABLESPACE. That can only happen between transactions (see
> + * DropUndoLogsInsTablespace()).
> + */
>
> /DropUndoLogsInsTablespace/DropUndoLogsInTablespace

Fixed.

> 15.
> UndoLogSegmentPath()
> {
> ..
> /*
> + * Build the path from log number and offset. The pathname is the
> + * UndoRecPtr of the first byte in the segment in hexadecimal, with a
> + * period inserted between the components.
> + */
> + snprintf(path, MAXPGPATH, "%s/%06X.%010zX", dir, logno,
> + segno * UndoLogSegmentSize);
> ..
> }
>
> a. It is not very clear from the above code why are we multiplying
> segno with UndoLogSegmentSize? I see that many of the callers pass
> segno as segno/UndoLogSegmentSize. Won't it be better if the caller
> take care of passing correct value of segno?

We want "the UndoRecPtr of the first byte in the segment [...] with a
period inserted between the components". Seems clear? So undo log 7,
segno 0 will be 000007.0000000000 and undo log 7, segno 1 will be
000007.0000100000, and UndoRecPtr of its first byte is at
0000070000100000 (so when you're looking at pg_stat_undo_logs or
undoinspect() or any other representation of undo record pointers, you
can easily see which files they are referring to). It's true that we
could pass in the offset of the first byte, instead of the segment
number, but some other callers have a segment number (see undofile.c).

> b. In the comment above, instead of offset, shouldn't there be segment number.

No, segno * segment size == offset (the offset part of an UndoRecPtr
is the lower 48 bits; the upper 24 bits are the undo log number).

> 16. UndoLogGetLastXactStartPoint is not used any where. I think this
> was required in previous version of patchset, now, we can remove it.

Done, thanks.

> 17.
> Discussion: https://postgr.es/m/CAEepm%3D2EqROYJ_xYz4v5kfr4b0qw_Lq_6Pe8RTEC8rx3upWsSQ%40mail.gmail.com
>
> This discussion link seems to be from old discussion/thread, not this one.

Will reference this one.

> 0019-Add-developer-documentation-for-the-undo-log-storage
> 18.
> +each undo log, a set of meta-data properties is tracked:
> +tracked, including:
> +
> +* the tablespace that holds its segment files
> +* the persistence level (permanent, unlogged or temporary)
>
> Here, don't we want to refer to UndoLogCategory rather than
> persistence level? "tracked, including:" seems bit confusing.

Fixed here and elsewhere.

> 0020-Add-user-facing-documentation-for-undo-logs
> 19.
> <row>
> + <entry><structfield>persistence</structfield></entry>
> + <entry><type>text</type></entry>
> + <entry>Persistence level of data stored in this undo log; one of
> + <literal>permanent</literal>, <literal>unlogged</literal> or
> + <literal>temporary</literal>.</entry>
> + </row>
>
> Don't we want to cover the new (shared) undolog category here?

Done (though I have mixed feelings about this shared category; more on
that soon).

On Thu, Jul 25, 2019 at 12:14 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Wed, Jul 24, 2019 at 2:45 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > On Thu, Jul 18, 2019 at 5:10 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > 7.
> > +attach_undo_log(UndoLogCategory category, Oid tablespace)
> > {
> > ..
> > if (candidate->meta.tablespace == tablespace)
> > + {
> > + logno = *place;
> > + slot = candidate;
> > + *place = candidate->next_free;
> > + break;
> > + }
> >
> > Here, the code is breaking from the loop, so why do we need to set
> > *place? Am I missing something obvious?
> >
>
> I think I know what I was missing. It seems here you are removing an
> element from the freelist.

Right.

> One point related to detach_current_undo_log.
>
> + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
> + slot->pid = InvalidPid;
> + slot->meta.unlogged.xid = InvalidTransactionId;
> + if (full)
> + slot->meta.status = UNDO_LOG_STATUS_FULL;
> + LWLockRelease(&slot->mutex);
>
> If I read the comments in structure UndoLogMetaData, it is mentioned
> that 'status' is changed by explicit WAL record whereas there is no
> WAL record in code to change the status. I see the problem as well if
> we don't WAL log this change. Suppose after changing the status of
> this log, we allocate a new log and insert some records in that log as
> well for the same transaction for which we have inserted records in
> the log which we just marked as FULL. Now, here we form the link
> between two logs as the same transaction has overflowed into a new
> log. Say, we crash after this. Now, after recovery the log won't be
> marked as FULL which means there is a chance that it can be used for
> some other transaction, if that happens, then our link for a
> transaction spanning to different log will break and we won't be able
> to access the data in another log. In short, I think it is important
> to WAL log this status change unless I am missing something.

I thought it was OK to relax that and was going to just fix the
comment, but the case you describe seems important. It seems we could
either by WAL-logging the status changes as you said, or make sure the
links have enough information to handle that. I'll think about that
some more.

On Thu, Jul 25, 2019 at 10:37 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> Some more review of the same patch:
> 1.
> +typedef struct UndoLogSharedData
> +{
> + UndoLogNumber free_lists[UndoLogCategories];
> + UndoLogNumber low_logno;
>
> What is the use of low_logno? I don't see anywhere in the code this
> being assigned any value. Is it for some future use?

Yeah, fixed, and now used. It reduces the need for 'negative cache
entries' after a backend has been running for a very long time.

> 2.
> +void
> +CheckPointUndoLogs(XLogRecPtr checkPointRedo, XLogRecPtr priorCheckPointRedo)
> {
> ..
> + /* Compute header checksum. */
> + INIT_CRC32C(crc);
> + COMP_CRC32C(crc, &UndoLogShared->low_logno, sizeof(UndoLogShared->low_logno));
> + COMP_CRC32C(crc, &UndoLogShared->next_logno,
> sizeof(UndoLogShared->next_logno));
> + COMP_CRC32C(crc, &num_logs, sizeof(num_logs));
> + FIN_CRC32C(crc);
> +
> + /* Write out the number of active logs + crc. */
> + if ((write(fd, &UndoLogShared->low_logno,
> sizeof(UndoLogShared->low_logno)) != sizeof(UndoLogShared->low_logno))
> ||
> + (write(fd, &UndoLogShared->next_logno,
> sizeof(UndoLogShared->next_logno)) !=
> sizeof(UndoLogShared->next_logno)) ||
>
> Is it safe to read UndoLogShared without UndoLogLock? All other
> places accessing UndoLogShared uses UndoLogLock, so if this usage is
> safe, maybe it is better to add a comment.

Fixed for next_logno. And the other one, low_logno is no longer
written to disk (it can be computed).

> 3.
> UndoLogAllocateInRecovery()
> {
> ..
> /*
> + * Otherwise we need to do our own transaction tracking
> + * whenever we see a new xid, to match the logic in
> + * UndoLogAllocate().
> + */
> + if (xid != slot->meta.unlogged.xid)
> + {
> + slot->meta.unlogged.xid = xid;
> + if (slot->meta.unlogged.this_xact_start != slot->meta.unlogged.insert)
> + slot->meta.unlogged.last_xact_start =
> + slot->meta.unlogged.this_xact_start;
> + slot->meta.unlogged.this_xact_start =
> + slot->meta.unlogged.insert;
>
> The code doesn't follow the comment. In UndoLogAllocate, both
> last_xact_start and this_xact_start are assigned in if block, so the
> should be the case here.

True, in "do" I only did the assignment if the values were different,
and in "redo" I did the assignment even if they were the same, which
has the same effect, but is indeed distracting. I've made them the
same.

> 4.
> UndoLogAllocateInRecovery()
> {
> ..
> + /*
> + * Just as in UndoLogAllocate(), the caller may be extending an existing
> + * allocation before committing with UndoLogAdvance().
> + */
> + if (context->try_location != InvalidUndoRecPtr)
> + {
> ..
> }
>
> I am not sure how will this work because unlike UndoLogAllocate, this
> function doesn't set try_location initially. It will be set later by
> UndoLogAdvance which can easily go wrong because that doesn't include
> UndoLogBlockHeaderSize.

Hmm, yeah, I need to look into that some more. The 'advance' function
does include consider header bytes though.

I do admit that this code is very hard to follow. It got that way by
being developed before the 'context' existed. It needs to be
rewritten in a much clearer way; I'm going to do that.

> 5.
> +UndoLogAdvance(UndoLogAllocContext *context, size_t size)
> +{
> + context->try_location = UndoLogOffsetPlusUsableBytes(context->try_location,
> + size);
> +}
>
> Here, you are using UndoRecPtr whereas UndoLogOffsetPlusUsableBytes
> expects offset.

Yeah that is ugly. I created UndoRecPtrPlusUsableBytes().

> 6.
> UndoLogAllocateInRecovery()
> {
> ..
> + /*
> + * At this stage we should have an undo log that can handle this
> + * allocation. If we don't, something is screwed up.
> + */
> + if (UndoLogOffsetPlusUsableBytes(slot->meta.unlogged.insert, size) >
> slot->meta.end)
> + elog(ERROR,
> + "cannot allocate %d bytes in undo log %d",
> + (int) size, slot->logno);
> ..
> }
>
> Similar to point-5, here you are using a pointer instead of offset.

Fixed.

> 7.
> UndoLogAllocateInRecovery()
> {
> ..
> + /* We found a reference to a different (or first) undo log. */
> + slot = find_undo_log_slot(logno, false);
> ..
> + /* TODO: check locking against undo log slot recycling? */
> ..
> }
>
> I think it is better to have an Assert here that slot can't be NULL.
> AFAICS, slot can't be NULL unless there is some bug. I don't
> understand this 'TODO' comment.

Yeah. I just removed it. "slot" was already dereferenced above so an
assertion that it's not NULL is too late to have any useful effect.

> 8.
> + {
> + {"undo_tablespaces", PGC_USERSET,
> CLIENT_CONN_STATEMENT,
> + gettext_noop("Sets the
> tablespace(s) to use for undo logs."),
> + NULL,
> +
> GUC_LIST_INPUT | GUC_LIST_QUOTE
> + },
> +
> &undo_tablespaces,
> + "",
> + check_undo_tablespaces,
> assign_undo_tablespaces, NULL
> + },
>
> It seems you need to update variable_is_guc_list_quote for this variable.

Huh. Right. Done.

> 9.
> +extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
> {
> ..
> + if (!InRecovery)
> + {
> + xl_undolog_extend xlrec;
> + XLogRecPtr ptr;
> +
> + xlrec.logno = logno;
> + xlrec.end = end;
> +
> + XLogBeginInsert();
> + XLogRegisterData((char *) &xlrec, sizeof(xlrec));
> + ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_EXTEND);
> + XLogFlush(ptr);
> + }
> ..
> }
>
> Do we need it for temporary/unlogged persistence level? Similarly,
> there is a WAL logging in attach_undo_log which I can't understand why
> it would be required for temporary/unlogged persistence levels.

You're right, if we crash we don't care about any data in
temporary/unlogged undo logs, since that data belongs to
temporary/unlogged zheap (etc) tables. We destroy all of their files
at startup in ResetUndoLogs(). So therefore we might as well not both
to log the extension stuff.

Done.

> 10.
> +choose_undo_tablespace(bool force_detach, Oid *tablespace)
> {
> ..
> + oid = get_tablespace_oid(name, true);
> + if (oid == InvalidOid)
> ..
> }
>
> Do we need to check permissions to see if the current user is allowed
> to create in this tablespace?

Yeah, right. I added a pg_tablespace_aclcheck() check

postgres=> set undo_tablespaces = ts1;
SET
postgres=> create table t ();
ERROR: permission denied for tablespace ts1

> 11.
> +static bool
> +choose_undo_tablespace(bool force_detach, Oid *tablespace)
> +{
> + char *rawname;
> + List *namelist;
> + bool
> need_to_unlock;
> + int length;
> + int
> i;
> +
> + /* We need a modifiable copy of string. */
> + rawname =
> pstrdup(undo_tablespaces);
>
> I don't see the usage of rawname outside this function, isn't it
> better to free it? I understand that this function won't be called
> frequently enough to matter, but still, there is some theoretical
> danger if the user continuously changes undo_tablespaces.

Fixed by freeing both rawname and namelist.

> 12.
> +find_undo_log_slot(UndoLogNumber logno, bool locked)
> {
> ..
> + * TODO: We could track the lowest known undo log
> number, to reduce
> + * the negative cache entry bloat.
> + */
> + if (result == NULL)
> + {
> ..
> }
>
> Do we have any mechanism to clear this bloat or will it stay till the
> end of the session? If it is later, then I think it might be good to
> take care of this TODO. I think this is not a blocker, but good to
> have kind of stuff.

I did the TODO, so now we can drop negative cache entries below
low_logno from the cache. There are probably more things we could do
here to be more aggressive but that's a start.

> 13.
> +static void
> +allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace,
> + UndoLogOffset end)
> {
> ..
> }
>
> What will happen if the transaction creating undolog segment rolls
> back? Do we want to have pendingDeletes stuff as we have for normal
> relation files? This might also help in clearing the shared memory
> state (undo log slots) if any.

No, that's non-transactional. The undo log segment remains created,
just like various other things stay permanently even if the
transaction that created them aborts (relation extension, btree
splits, ...).

--
Thomas Munro
https://enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2019-08-16 05:15:42 Re: POC: Cleaning up orphaned files using undo logs
Previous Message Masahiko Sawada 2019-08-16 04:32:13 Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)