Quick Links

Re: POC: Cleaning up orphaned files using undo logs

From:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To:	Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, vignesh C <vignesh21(at)gmail(dot)com>
Subject:	Re: POC: Cleaning up orphaned files using undo logs
Date:	2019-08-16 05:15:42
Message-ID:	CA+hUKG+=aWx7rMDrCh_5YG6M-OvNX2QwRQjJrngHovxW6oFvQw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi Kuntal,

On Thu, Jul 25, 2019 at 5:40 PM Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
> Here are some review comments on 0003-Add-undo-log-manager.patch. I've
> tried to avoid duplicate comments as much as possible.

Thanks! Replies inline. I'll be posting a new patch set shortly with
these and other fixes.

> 1. In UndoLogAllocate,
> + * time this backend as needed to write to an undo log at all or because
> s/as/has

Fixed.

> + * Maintain our tracking of the and the previous transaction start
> Do you mean current log's transaction start as well?

Right, fixed.

> 2. In UndoLogAllocateInRecovery,
> we try to find the current log from the first undo buffer. So, after a
> log switch, we always have to register at least one buffer from the
> current undo log first. If we're updating something in the previous
> log, the respective buffer should be registered after that. I think we
> should document this in the comments.

I'm not sure I understand. Is this working correctly today?

> 3. In UndoLogGetOldestRecord(UndoLogNumber logno, bool *full),
> it seems the 'full' parameter is not used anywhere. Do we still need this?
>
> + /* It's been recycled. SO it must have been entirely discarded. */
> s/SO/So

Fixed.

> 4. In CleanUpUndoCheckPointFiles,
> we can emit a debug2 message with something similar to : 'removed
> unreachable undo metadata files'

Done.

> + if (unlink(path) != 0)
> + elog(ERROR, "could not unlink file \"%s\": %m", path);
> according to my observation, whenever we deal with a file operation,
> we usually emit a ereport message with errcode_for_file_access().
> Should we change it to ereport? There are other file operations as
> well including read(), OpenTransientFile() etc.

Done.

> 5. In CheckPointUndoLogs,
> + /* Capture snapshot while holding each mutex. */
> + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
> + serialized[num_logs++] = slot->meta;
> + LWLockRelease(&slot->mutex);
> why do we need an exclusive lock to read something from the slot? A
> share lock seems to be sufficient.

OK.

> pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_SYNC) is called
> after pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_WRITE)
> without calling pgstat_report_wait_end(). I think you've done the
> same to avoid an extra function call. But, it differs from other
> places in the PG code. Perhaps, we should follow this approach
> everywhere.

Ok, changed.

> 6. In StartupUndoLogs,
> + if (fd < 0)
> + elog(ERROR, "cannot open undo checkpoint snapshot \"%s\": %m", path);
> assuming your agreement upon changing above elog to ereport, the
> message should be more user friendly. May be something like 'cannot
> open pg_undo file'.

Done.

> + if ((size = read(fd, &slot->meta, sizeof(slot->meta))) != sizeof(slot->meta))
> The usage of size of doesn't look like a problem. But, we can save
> some extra padding bytes at the end if we use (offsetof + sizeof)
> approach similar to other places in PG.

It current ends in a 64 bit value, so there is no padding here.

> 7. In free_undo_log_slot,
> + /*
> + * When removing an undo log from a slot in shared memory, we acquire
> + * UndoLogLock, log->mutex and log->discard_lock, so that other code can
> + * hold any one of those locks to prevent the slot from being recycled.
> + */
> + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
> + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
> + Assert(slot->logno != InvalidUndoLogNumber);
> + slot->logno = InvalidUndoLogNumber;
> + memset(&slot->meta, 0, sizeof(slot->meta));
> + LWLockRelease(&slot->mutex);
> + LWLockRelease(UndoLogLock);
> you've not taken the discard_lock as mentioned in the comment.

Right, I was half-way between two different ideas about how that
interlocking should work, but I have straightened this out now, and
will write about the overall locking model separately.

> 8. In find_undo_log_slot,
> + * 1. If the calling code knows that it is attached to this lock or is the
> s/lock/slot

Fixed.

BTW I am experimenting with macros that would actually make assertions
about those programming rules.

> + * 2. All other code should acquire log->mutex before accessing any members,
> + * and after doing so, check that the logno hasn't moved. If it is not, the
> + * entire undo log must be assumed to be discarded (as if this function
> + * returned NULL) and the caller must behave accordingly.
> Perhaps, you meant '..check that the logno remains same. If it is not..'.

Fixed.

> + /*
> + * If we didn't find it, then it must already have been entirely
> + * discarded. We create a negative cache entry so that we can answer
> + * this question quickly next time.
> + *
> + * TODO: We could track the lowest known undo log number, to reduce
> + * the negative cache entry bloat.
> + */
> This is an interesting thought. But, I'm wondering how we are going to
> search the discarded logno in the simple hash. I guess that's why it's
> in the TODO list.

Done. Each backend tracks its idea of the lowest undo log that
exists. There is a shared low_logno that is recomputed whenever a
slot is freed (ie a log is entirely discarded). Whenever a backend
sees that its own value is too low, it walks forward dropping cache
entries. Perhaps this could be made more proactive later by using
sinval, but I didn't look into that.

> 9. In attach_undo_log,
> + * For now we have a simple linked list of unattached undo logs for each
> + * persistence level. We'll grovel though it to find something for the
> + * tablespace you asked for. If you're not using multiple tablespaces
> s/though/through

Fixed.

> + if (slot == NULL)
> + {
> + if (UndoLogShared->next_logno > MaxUndoLogNumber)
> + {
> + /*
> + * You've used up all 16 exabytes of undo log addressing space.
> + * This is a difficult state to reach using only 16 exabytes of
> + * WAL.
> + */
> + elog(ERROR, "undo log address space exhausted");
> + }
> looks like a potential unlikely() condition.

Done. Yeah, actually every branch containing an unconditional elog()
at ERROR or higher (or maybe even lower) must surely be considered
unlikely, and it'd be nice to tell the leading compilers about that,
but the last thread about that hasn't made it as far as a useful patch
for some technical reason that didn't seem fatal to the concept, IIRC.
I'd be curious to know what sort effect that sort of rule would have
on the whole tree, in terms of code locality, even if you have to hack
the compiler to find out...

--
Thomas Munro
https://enterprisedb.com

In response to

Re: POC: Cleaning up orphaned files using undo logs at 2019-07-25 05:39:57 from Kuntal Ghosh

Responses

Re: POC: Cleaning up orphaned files using undo logs at 2019-08-20 09:02:18 from Thomas Munro

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2019-08-16 05:26:00	Re: POC: Cleaning up orphaned files using undo logs
Previous Message	Thomas Munro	2019-08-16 04:42:34	Re: POC: Cleaning up orphaned files using undo logs