Re: 9.6rc1 Background worker starting multiple times

From: Keith Fiske <keith(at)omniti(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: 9.6rc1 Background worker starting multiple times
Date: 2016-12-23 21:18:11
Message-ID: CAG1_KcDdckBorU3B7H-eGmpcGM9HQMnYfSNOKMU3HWsx-rPcrg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, Dec 20, 2016 at 12:04 PM, Keith Fiske <keith(at)omniti(dot)com> wrote:

>
>
> -On Tue, Dec 20, 2016 at 10:26 AM, Keith Fiske <keith(at)omniti(dot)com> wrote:
>
>>
>>
>> On Tue, Dec 20, 2016 at 1:42 AM, Michael Paquier <
>> michael(dot)paquier(at)gmail(dot)com> wrote:
>>
>>> On Tue, Dec 20, 2016 at 6:47 AM, Keith Fiske <keith(at)omniti(dot)com> wrote:
>>> > I'm not quite sure exactly how to debug this further to see what is
>>> actually
>>> > setting the latch in 9.6.
>>>
>>> The commit you are complaining about here is likely this one:
>>> commit: db0f6cad4884bd4c835156d3a720d9a79dbd63a9
>>> author: Robert Haas <rhaas(at)postgresql(dot)org>
>>> date: Fri, 9 Oct 2015 14:31:04 -0400
>>> Remove set_latch_on_sigusr1 flag.
>>>
>>> This flag has proven to be a recipe for bugs, and it doesn't seem like
>>> it can really buy anything in terms of performance. So let's just
>>> *always* set the process latch when we receive SIGUSR1 instead of
>>> trying to do it only when needed.
>>>
>>> So your process gets a SIGUSR1 and its latch is set, causing the loop
>>> to repeat... I would discard that as a backend bug, because the
>>> removal of set_latch_on_sigusr1 was a good move.
>>> --
>>> Michael
>>>
>>
>> Anything I can do in the mean time to fix this, or have to wait for a
>> bugfix patch?
>>
>> Keith
>>
>
> So, attached is my naive attempt to try and handle SIGUSR1. Can at least
> confirm this is the source of the problem, but I'm unable to figure out how
> to work around it. Handling it does in fact stop the background worker from
> going nuts anymore. But it also completely stops the maintenance loop from
> working properly. It runs once, then never again. It seems to get stuck
> after the call to WaitForBackgroundWorkerStartup() and never even gets to
> my WaitLatch() call. There is logging that it successfully unregisters the
> dynamic BGW, but never seems to go further than that. I tried adding
> latch_sigusr1_handler() similar to how the handler in procsignal.c does,
> but didn't seem to fix it. With some additional debugging I added, it
> doesn't even appear to be calling my custom signal handler function at all.
> got_sigusr1 is never set to true.
>
> The simple act of registering to listen for the SIGUSR1 seems to stop the
> original bug. If I comment out just this line in pg_partman_bgw_main() (the
> one in pg_partman_bgw_run_maint() seems to make no difference):
>
> pqsignal(SIGUSR1, pg_partman_bgw_sigusr1);
>
> it goes back to doing what it was doing before.
>
> Guessing there's a lot more to SIGUSR1 than I can see right now. But
> figured I'd at least try to provide some additional investigation. I
> noticed in the BGW code, it mentions using SIGUSR1 on start/stop.
>
> https://doxygen.postgresql.org/bgworker_8h_source.html#l00099
>
> Perhaps that's part of the issue when I try and handle it instead?
>
> partman maintenance timer is set for 10 seconds so the timestamp on the
> log entries shows it's not running on time.
>
> 2016-12-20 11:56:46 EST [] [14486]: [7-1] user=,db=,e=00000 LOG: database
> system is ready to accept connections
> 2016-12-20 11:56:46 EST [] [14495]: [1-1] user=,db=,e=00000 LOG:
> pg_partman master background worker master process initialized with role
> keith
> 2016-12-20 11:56:46 EST [] [14495]: [2-1] user=,db=,e=00000 DEBUG: Latch
> status just before for loop: 1
> 2016-12-20 11:56:46 EST [] [14495]: [3-1] user=,db=,e=00000 DEBUG: Latch
> status inside for loop before reset: 1
> 2016-12-20 11:56:46 EST [] [14495]: [4-1] user=,db=,e=00000 DEBUG: Latch
> status just after reset before interupt check: 0
> 2016-12-20 11:56:46 EST [] [14495]: [5-1] user=,db=,e=00000 DEBUG: Latch
> status just after reset after interupt check: 0
> 2016-12-20 11:56:46 EST [] [14495]: [6-1] user=,db=,e=00000 DEBUG: After
> sighup check: 0
> 2016-12-20 11:56:46 EST [] [14495]: [7-1] user=,db=,e=00000 DEBUG: After
> sigterm check: 0
> 2016-12-20 11:56:46 EST [] [14495]: [8-1] user=,db=,e=00000 DEBUG: Before
> sigusr1 check: 0
> 2016-12-20 11:56:46 EST [] [14495]: [9-1] user=,db=,e=00000 DEBUG: After
> sigusr1 check: 0
> 2016-12-20 11:56:46 EST [] [14495]: [10-1] user=,db=,e=00000 DEBUG:
> Dynamic bgw launch begun for keith (0)
> 2016-12-20 11:56:46 EST [] [14495]: [11-1] user=,db=,e=00000 DEBUG:
> Registering dynamic background worker...
> 2016-12-20 11:56:46 EST [] [14495]: [12-1] user=,db=,e=00000 DEBUG:
> Waiting for BGW startup...
> 2016-12-20 11:56:46 EST [] [14486]: [8-1] user=,db=,e=00000 DEBUG:
> registering background worker "pg_partman dynamic background worker
> (dbname=keith)"
> 2016-12-20 11:56:46 EST [] [14486]: [9-1] user=,db=,e=00000 DEBUG:
> starting background worker process "pg_partman dynamic background worker
> (dbname=keith)"
> 2016-12-20 11:56:46 EST [] [14493]: [1-1] user=,db=,e=00000 LOG:
> autovacuum launcher started
> 2016-12-20 11:56:46 EST [] [14496]: [1-1] user=,db=,e=00000 DEBUG: Before
> parsing dbname GUC in dynamic main func: keith
> 2016-12-20 11:56:46 EST [] [14496]: [2-1] user=,db=,e=00000 DEBUG: GUC
> rawstring copy: keith
> 2016-12-20 11:56:46 EST [] [14496]: [3-1] user=,db=,e=00000 DEBUG:
> Entered foreach loop: name (keith), db_main_counter (0), dbcounter (0)
> 2016-12-20 11:56:46 EST [] [14496]: [4-1] user=,db=,e=00000 DEBUG:
> Parsing list: keith (0)
> 2016-12-20 11:56:46 EST [] [14496]: [5-1] user=,db=,e=00000 DEBUG: Before
> run_maint initialize connection for db keith
> 2016-12-20 11:56:46 EST [] [14496]: [6-1] user=,db=,e=00000 DEBUG: After
> run_maint initialize connection for db keith
> 2016-12-20 11:56:46 EST [] [14496]: [7-1] user=,db=,e=00000 DEBUG:
> Checking if pg_partman extension is installed in database: keith
> 2016-12-20 11:56:46 EST [] [14496]: [8-1] user=,db=,e=00000 LOG:
> pg_partman dynamic background worker (dbname=keith) dynamic background
> worker initialized with role keith on database keith
> 2016-12-20 11:56:46 EST [] [14496]: [9-1] user=,db=,e=00000 LOG:
> pg_partman dynamic background worker (dbname=keith): SELECT
> "partman".run_maintenance(p_analyze := true, p_jobmon := true) called by
> role keith on database keith
> 2016-12-20 11:56:46 EST [] [14496]: [10-1] user=,db=,e=00000 DEBUG:
> pg_partman dynamic BGW shutting down gracefully for database keith.
> 2016-12-20 11:56:46 EST [] [14486]: [10-1] user=,db=,e=00000 DEBUG:
> worker process: pg_partman dynamic background worker (dbname=keith) (PID
> 14496) exited with exit code 0
> 2016-12-20 11:56:46 EST [] [14486]: [11-1] user=,db=,e=00000 DEBUG:
> unregistering background worker "pg_partman dynamic background worker
> (dbname=keith)"
> 2016-12-20 11:57:16 EST [] [14506]: [1-1] user=,db=,e=00000 DEBUG:
> autovacuum: processing database "keith"
> 2016-12-20 11:57:46 EST [] [14515]: [1-1] user=,db=,e=00000 DEBUG:
> autovacuum: processing database "postgres"
> 2
>
>
Found this pull request in pg_cron where they suggested using WaitLatch
similar to what I was doing in pg_partman.

https://github.com/citusdata/pg_cron/pull/15

Decided to apply that patch and see if it worked. After fixing the missing
WL_TIMEOUT flag, everything was working fine and this extension didn't seem
to have the same problem pg_partman was. So, went back to adding some more
debugging flags and my suspicion about the use of SIGUSR1 on background
worker startup/stop seems justified (thanks to whoever added that comment).

I noticed the timing of things in the log (see below). The dynamic
background worker is unregistered just before the WaitLatch() is called. So
perhaps it's the unregistering of the dynamic BGW worker that the main BGW
process had called that's setting the latch...

Looking through bgworker.c, I found this comment

https://github.com/postgres/postgres/blob/REL9_6_STABLE/
src/backend/postmaster/bgworker.c#L257

Sure enough, I use bgw_notify_pid so my main BGW process can know when my
dynamic worker finishes running. And it seems it sends SIGUSR1 to do this
notification if I'm reading this right

https://github.com/postgres/postgres/blob/REL9_6_STABLE/
src/backend/postmaster/bgworker.c#L276

So, now because the SIGUSR1 signal always sets the latch in 9.6, this
WaitLatch process doesn't seem like it can be used when a BGW itself starts
other dynamic background workers and has to rely on them informing it when
they're done running.

Can anyone confirm/deny what I think I've found here? If this is true, any
idea how this can be handled now in 9.6 or is this a bug that needs to be
fixed?

2016-12-23 15:31:36 EST [] [30766]: [1-1] user=,db=,e=00000 LOG:
pg_partman master background worker master process initialized with role
keith
2016-12-23 15:31:36 EST [] [30766]: [3-1] user=,db=,e=00000 DEBUG: Latch
status inside for loop before reset: 1
2016-12-23 15:31:36 EST [] [30766]: [4-1] user=,db=,e=00000 DEBUG: Latch
status just after reset before interupt check: 0
[..]
2016-12-23 15:31:36 EST [] [30766]: [14-1] user=,db=,e=00000 DEBUG: Latch
status just after BGW status check: 0
2016-12-23 15:31:36 EST [] [30766]: [15-1] user=,db=,e=00000 DEBUG: Latch
status just before wait call: 0
[...]
2016-12-23 15:31:36 EST [] [30758]: [11-1] user=,db=,e=00000 DEBUG:
unregistering background worker "pg_partman dynamic background worker
(dbname=keith)"
2016-12-23 15:31:36 EST [] [30766]: [16-1] user=,db=,e=00000 DEBUG: Latch
status after wait call: 1
2016-12-23 15:31:36 EST [] [30766]: [18-1] user=,db=,e=00000 DEBUG: Latch
status inside for loop before reset: 1
2016-12-23 15:31:36 EST [] [30766]: [19-1] user=,db=,e=00000 DEBUG: Latch
status just after reset before interupt check: 0
[...]
2016-12-23 15:31:36 EST [] [30766]: [29-1] user=,db=,e=00000 DEBUG: Latch
status just after BGW status check: 0
2016-12-23 15:31:36 EST [] [30766]: [30-1] user=,db=,e=00000 DEBUG: Latch
status just before wait call: 0
[...]
2016-12-23 15:31:36 EST [] [30758]: [15-1] user=,db=,e=00000 DEBUG:
unregistering background worker "pg_partman dynamic background worker
(dbname=keith)"
2016-12-23 15:31:36 EST [] [30766]: [31-1] user=,db=,e=00000 DEBUG: Latch
status after wait call: 1
2016-12-23 15:31:36 EST [] [30766]: [33-1] user=,db=,e=00000 DEBUG: Latch
status inside for loop before reset: 1
2016-12-23 15:31:36 EST [] [30766]: [34-1] user=,db=,e=00000 DEBUG: Latch
status just after reset before interupt check: 0

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Keith Fiske 2016-12-23 22:47:06 Re: 9.6rc1 Background worker starting multiple times
Previous Message Joe Conway 2016-12-23 15:54:22 Re: BUG #14471: PostgreSQL 9.6 "NOT IN" in select causes crash with "ERROR: unknown error"