From: | Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks |
Date: | 2020-07-14 08:41:29 |
Message-ID: | CALj2ACXJgJ9LtrgFQK4C2sd1r26TyLb1EFx_VXRw1LouYZ0czg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Jul 7, 2020 at 10:24 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> On 2020-07-07 09:44:41 -0400, Tom Lane wrote:
> > Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> writes:
> > > Firstly, pg_start_backup registers nonexclusive_base_backup_cleanup as
> > > on_shmem_exit call back which will
> > > add this function to the before_shmem_exit_list array which is
> > > supposed to be removed on pg_stop_backup
> > > so that we can do the pending cleanup and issue a warning for each
> > > pg_start_backup for which we did not call
> > > the pg_stop backup. Now, I executed a query for which JIT is picked
> > > up, then the the llvm compiler inserts it's
> > > own exit callback i.e. llvm_shutdown at the end of
> > > before_shmem_exit_list array. Now, I executed pg_stop_backup
> > > and call to cancel_before_shmem_exit() is made with the expectation
> > > that the nonexclusive_base_backup_cleanup
> > > callback is removed from before_shmem_exit_list array.
> >
> > I'm of the opinion that the JIT code is abusing this mechanism, and the
> > right thing to do is fix that.
>
> What are you proposing? For now we could easily enough work around this
> by just making it a on_proc_exit() callback, but that doesn't really
> change the fundamental issue imo.
>
> > The restriction you propose to remove is not just there out of
> > laziness, it's an expectation about what safe use of this mechanism
> > would involve. Un-ordered removal of callbacks seems pretty risky: it
> > would mean that whatever cleanup is needed is not going to be done in
> > LIFO order.
>
I quickly searched(in HEAD) what all the callbacks are getting
registered to before_shmem_exit_list, with the intention to see if
they also call corresponding cancel_before_shmem_exit() after their
intended usage is done.
For few of the callbacks there is no cancel_before_shmem_exit(). This
seems expected; those callbacks ought to be executed before shmem
exit. These callbacks are(let say SET 1): ShutdownPostgres,
logicalrep_worker_onexit, llvm_shutdown, Async_UnlistenOnExit,
RemoveTempRelationsCallback, ShutdownAuxiliaryProcess,
do_pg_abort_backup in xlog.c (this callback exist only in v13 or
later), AtProcExit_Twophase.
Which means, once they are into the before_shmem_exit_list array, in
some order, they are never going to be removed from it as they don't
have corresponding cancel_before_shmem_exit() and the relative order
of execution remains the same.
And there are other callbacks that are getting registered to
before_shmem_exit_list array(let say SET 2): apw_detach_shmem,
_bt_end_vacuum_callback, pg_start_backup_callback,
pg_stop_backup_callback, createdb_failure_callback,
movedb_failure_callback, do_pg_abort_backup(in basebackup.c). They all
have corresponding cancel_before_shmem_exit() to unregister/remove the
callbacks from before_shmem_exit_list array.
I think the callbacks that have no cancel_before_shmem_exit()(SET 1)
may have to be executed in the LIFO order: it makes sense to execute
ShutdownPostgres at the end after let's say other callbacks in SET 1.
And the SET 2 callbacks have cancel_before_shmem_exit() with the only
intention that there's no need to call the callbacks on the
before_shmem_exit(), since they are not needed, and try to remove from
the before_shmem_exit_list array and may fail, if any other callback
gets registered in between.
If I'm not wrong with the above points, we must enhance
cancel_before_shmem_exit() or have cancel_before_shmem_exit_v2() (as
mentioned in my below response).
>
> Maybe I am confused, but isn't it <13's pg_start_backup() that's
> violating the protocol much more clearly than the JIT code? Given that
> it relies on there not being any callbacks registered between two SQL
> function calls? I mean, what it does is basically:
>
> 1) before_shmem_exit(nonexclusive_base_backup_cleanup...
> 2) arbitrary code executed for a long time
> 3) cancel_before_shmem_exit(nonexclusive_base_backup_cleanup...
>
> which pretty obviously can't at all deal with any other
> before_shmem_exit callbacks being registered in 2). Won't this be a
> problem for every other before_shmem_exit callback that we register
> on-demand? Say Async_UnlistenOnExit, RemoveTempRelationsCallback?
>
Yes, for versions <13's, clearly pg_start_backup causes the problem
and the issue can also be reproduced with Async_UnlistenOnExit,
RemoveTempRelationsCallback coming in between pg_start_backup and
pg_stop_backup.
We can have it fixed in a few ways: 1) enhance
cancel_before_shmem_exit() as attached in the original patch. 2) have
existing cancel_before_shmem_exit(), whenever called for
nonexclusive_base_backup_cleanup(), we can look for the entire array
instead of just the last entry. 3) have a separate function, say,
cancel_before_shmem_exit_v2(), that searches for the entire
before_shmem_exit_list array(the logic proposed in this patch) so that
it will not disturb the existing cancel_before_shmem_exit(). 4) or try
to have the pg_start_backup code that exists in after > 13 versions.
If okay to have cancel_before_shmem_exit_v2() for versions < 13's,
with the searching for the entire array instead of just the last
element to fix the abort issue, maybe we can have this function in
version 13 and latest as well(at least with disable mode, something
like #if 0 ... #endif), so that in case if any of similar issues arise
we could just quickly reuse.
If the before_shmem_exit_list array is to be used in LIFO order, do we
have some comment/usage guideline mentioned in the ipc.c/.h? I didn't
find one.
Thoughts?
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Magnus Hagander | 2020-07-14 08:59:45 | Re: recovering from "found xmin ... from before relfrozenxid ..." |
Previous Message | Amit Kapila | 2020-07-14 08:29:31 | Re: replication_origin and replication_origin_lsn usage on subscriber |