| From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> | 
|---|---|
| To: | Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> | 
| Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Re: POC: Cleaning up orphaned files using undo logs | 
| Date: | 2019-08-09 08:39:16 | 
| Message-ID: | CAA4eK1K_6wQPd=f4JeeWeaJcKU-YopVd90WUouuKVodGRKFOAQ@mail.gmail.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On Tue, Jul 23, 2019 at 8:12 PM Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
>
> --------------------
>
> Some further review comments for undoworker.c :
>
>
> +/* Sets the worker's lingering status. */
> +static void
> +UndoWorkerIsLingering(bool sleep)
> The function name sounds like "is the worker lingering ?". Can we
> rename it to something like "UndoWorkerSetLingering" ?
>
makes sense, changed as per suggestion.
> -------------
>
> + errmsg("undo worker slot %d is empty, cannot attach",
> + slot)));
>
> + }
> +
> + if (MyUndoWorker->proc)
> + {
> + LWLockRelease(UndoWorkerLock);
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("undo worker slot %d is already used by "
> + "another worker, cannot attach", slot)));
>
> These two error messages can have a common error message "could not
> attach to worker slot", with errdetail separate for each of them :
> slot %d is empty.
> slot %d is already used by another worker.
>
> --------------
>
Changed as per suggestion.
> +static int
> +IsUndoWorkerAvailable(void)
> +{
> + int i;
> + int alive_workers = 0;
> +
> + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
>
> Should have bool return value.
>
> Also, why is it keeping track of number of alive workers ? Sounds like
> earlier it used to return number of alive workers ? If it indeed needs
> to just return true/false, we can do away with alive_workers.
>
> Also, *exclusive* lock is unnecessary.
>
> --------------
>
Changed as per suggestion.  Additionally, I changed the name of the
function to UndoWorkerIsAvailable(), so that it is similar to other
functions in the file.
> +if (UndoGetWork(false, false, &urinfo, NULL) &&
> +    IsUndoWorkerAvailable())
> +    UndoWorkerLaunch(urinfo);
>
> There is no lock acquired between IsUndoWorkerAvailable() and
> UndoWorkerLaunch(); that means even though IsUndoWorkerAvailable()
> returns true, there is a small window where UndoWorkerLaunch() does
> not find any worker slot with in_use false, causing assertion failure
> for (worker != NULL).
> --------------
>
I have removed the assert and instead added a warning.  I have also
added a comment from the place where we call UndoWorkerLaunch to
mention the race condition.
> +UndoWorkerGetSlotInfo(int slot, UndoRequestInfo *urinfo)
> +{
> + /* Block concurrent access. */
> + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
> *Exclusive* lock is unnecessary.
> -------------
>
Right, changed to share lock.
> + LWLockRelease(UndoWorkerLock);
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("undo worker slot %d is empty",
> + slot)));
> I believe there is no need to explicitly release an lwlock before
> raising an error, since the lwlocks get released during error
> recovery. Please check all other places where this is done.
> -------------
>
Fixed.
> + * Start new undo apply background worker, if possible otherwise return false.
> worker, if possible otherwise => worker if possible, otherwise
> -------------
>
> +static bool
> +UndoWorkerLaunch(UndoRequestInfo urinfo)
> We don't check UndoWorkerLaunch() return value. Can't we make it's
> return value type void ?
>
Now, the function returns void and accordingly I have adjusted the
comment which should address both the above comments.
> Also, it would be better to have urinfo as pointer to UndoRequestInfo
> rather than UndoRequestInfo, so as to avoid structure copy.
> -------------
>
Okay, changed as per suggestion.
> +{
> + BackgroundWorker bgw;
> + BackgroundWorkerHandle *bgw_handle;
> + uint16 generation;
> + int i;
> + int slot = 0;
> We can remove variable i, and use slot variable in place of i.
> -----------
>
> + snprintf(bgw.bgw_name, BGW_MAXLEN, "undo apply worker");
> I think it would be trivial to also append the worker->generation in
> the bgw_name.
> -------------
>
I am not sure if adding 'generation' is any useful. It might be better
to add database id as each worker can work on a particular database,
so that could be useful information.
>
> + if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle))
> + {
> + /* Failed to start worker, so clean up the worker slot. */
> + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
> + UndoWorkerCleanup(worker);
> + LWLockRelease(UndoWorkerLock);
> +
> + return false;
> + }
>
> Is it intentional that there is no (warning?) message logged when we
> can't register a bg worker ?
> -------------
>
Added a warning in that code path.
-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Amit Kapila | 2019-08-09 08:54:50 | Re: POC: Cleaning up orphaned files using undo logs | 
| Previous Message | Heikki Linnakangas | 2019-08-09 08:34:05 | default_table_access_method is not in sample config file |