Re: 9.6rc1 Background worker starting multiple times

From: Keith Fiske <keith(at)omniti(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: 9.6rc1 Background worker starting multiple times
Date: 2016-12-20 17:04:32
Message-ID: CAG1_KcCXZUoksYi23n7R0KP5DAm-F3BaTFa4ddp6jmCz9DdmiQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

-On Tue, Dec 20, 2016 at 10:26 AM, Keith Fiske <keith(at)omniti(dot)com> wrote:

>
>
> On Tue, Dec 20, 2016 at 1:42 AM, Michael Paquier <
> michael(dot)paquier(at)gmail(dot)com> wrote:
>
>> On Tue, Dec 20, 2016 at 6:47 AM, Keith Fiske <keith(at)omniti(dot)com> wrote:
>> > I'm not quite sure exactly how to debug this further to see what is
>> actually
>> > setting the latch in 9.6.
>>
>> The commit you are complaining about here is likely this one:
>> commit: db0f6cad4884bd4c835156d3a720d9a79dbd63a9
>> author: Robert Haas <rhaas(at)postgresql(dot)org>
>> date: Fri, 9 Oct 2015 14:31:04 -0400
>> Remove set_latch_on_sigusr1 flag.
>>
>> This flag has proven to be a recipe for bugs, and it doesn't seem like
>> it can really buy anything in terms of performance. So let's just
>> *always* set the process latch when we receive SIGUSR1 instead of
>> trying to do it only when needed.
>>
>> So your process gets a SIGUSR1 and its latch is set, causing the loop
>> to repeat... I would discard that as a backend bug, because the
>> removal of set_latch_on_sigusr1 was a good move.
>> --
>> Michael
>>
>
> Anything I can do in the mean time to fix this, or have to wait for a
> bugfix patch?
>
> Keith
>

So, attached is my naive attempt to try and handle SIGUSR1. Can at least
confirm this is the source of the problem, but I'm unable to figure out how
to work around it. Handling it does in fact stop the background worker from
going nuts anymore. But it also completely stops the maintenance loop from
working properly. It runs once, then never again. It seems to get stuck
after the call to WaitForBackgroundWorkerStartup() and never even gets to
my WaitLatch() call. There is logging that it successfully unregisters the
dynamic BGW, but never seems to go further than that. I tried adding
latch_sigusr1_handler() similar to how the handler in procsignal.c does,
but didn't seem to fix it. With some additional debugging I added, it
doesn't even appear to be calling my custom signal handler function at all.
got_sigusr1 is never set to true.

The simple act of registering to listen for the SIGUSR1 seems to stop the
original bug. If I comment out just this line in pg_partman_bgw_main() (the
one in pg_partman_bgw_run_maint() seems to make no difference):

pqsignal(SIGUSR1, pg_partman_bgw_sigusr1);

it goes back to doing what it was doing before.

Guessing there's a lot more to SIGUSR1 than I can see right now. But
figured I'd at least try to provide some additional investigation. I
noticed in the BGW code, it mentions using SIGUSR1 on start/stop.

https://doxygen.postgresql.org/bgworker_8h_source.html#l00099

Perhaps that's part of the issue when I try and handle it instead?

partman maintenance timer is set for 10 seconds so the timestamp on the log
entries shows it's not running on time.

2016-12-20 11:56:46 EST [] [14486]: [7-1] user=,db=,e=00000 LOG: database
system is ready to accept connections
2016-12-20 11:56:46 EST [] [14495]: [1-1] user=,db=,e=00000 LOG:
pg_partman master background worker master process initialized with role
keith
2016-12-20 11:56:46 EST [] [14495]: [2-1] user=,db=,e=00000 DEBUG: Latch
status just before for loop: 1
2016-12-20 11:56:46 EST [] [14495]: [3-1] user=,db=,e=00000 DEBUG: Latch
status inside for loop before reset: 1
2016-12-20 11:56:46 EST [] [14495]: [4-1] user=,db=,e=00000 DEBUG: Latch
status just after reset before interupt check: 0
2016-12-20 11:56:46 EST [] [14495]: [5-1] user=,db=,e=00000 DEBUG: Latch
status just after reset after interupt check: 0
2016-12-20 11:56:46 EST [] [14495]: [6-1] user=,db=,e=00000 DEBUG: After
sighup check: 0
2016-12-20 11:56:46 EST [] [14495]: [7-1] user=,db=,e=00000 DEBUG: After
sigterm check: 0
2016-12-20 11:56:46 EST [] [14495]: [8-1] user=,db=,e=00000 DEBUG: Before
sigusr1 check: 0
2016-12-20 11:56:46 EST [] [14495]: [9-1] user=,db=,e=00000 DEBUG: After
sigusr1 check: 0
2016-12-20 11:56:46 EST [] [14495]: [10-1] user=,db=,e=00000 DEBUG:
Dynamic bgw launch begun for keith (0)
2016-12-20 11:56:46 EST [] [14495]: [11-1] user=,db=,e=00000 DEBUG:
Registering dynamic background worker...
2016-12-20 11:56:46 EST [] [14495]: [12-1] user=,db=,e=00000 DEBUG:
Waiting for BGW startup...
2016-12-20 11:56:46 EST [] [14486]: [8-1] user=,db=,e=00000 DEBUG:
registering background worker "pg_partman dynamic background worker
(dbname=keith)"
2016-12-20 11:56:46 EST [] [14486]: [9-1] user=,db=,e=00000 DEBUG:
starting background worker process "pg_partman dynamic background worker
(dbname=keith)"
2016-12-20 11:56:46 EST [] [14493]: [1-1] user=,db=,e=00000 LOG:
autovacuum launcher started
2016-12-20 11:56:46 EST [] [14496]: [1-1] user=,db=,e=00000 DEBUG: Before
parsing dbname GUC in dynamic main func: keith
2016-12-20 11:56:46 EST [] [14496]: [2-1] user=,db=,e=00000 DEBUG: GUC
rawstring copy: keith
2016-12-20 11:56:46 EST [] [14496]: [3-1] user=,db=,e=00000 DEBUG: Entered
foreach loop: name (keith), db_main_counter (0), dbcounter (0)
2016-12-20 11:56:46 EST [] [14496]: [4-1] user=,db=,e=00000 DEBUG: Parsing
list: keith (0)
2016-12-20 11:56:46 EST [] [14496]: [5-1] user=,db=,e=00000 DEBUG: Before
run_maint initialize connection for db keith
2016-12-20 11:56:46 EST [] [14496]: [6-1] user=,db=,e=00000 DEBUG: After
run_maint initialize connection for db keith
2016-12-20 11:56:46 EST [] [14496]: [7-1] user=,db=,e=00000 DEBUG:
Checking if pg_partman extension is installed in database: keith
2016-12-20 11:56:46 EST [] [14496]: [8-1] user=,db=,e=00000 LOG:
pg_partman dynamic background worker (dbname=keith) dynamic background
worker initialized with role keith on database keith
2016-12-20 11:56:46 EST [] [14496]: [9-1] user=,db=,e=00000 LOG:
pg_partman dynamic background worker (dbname=keith): SELECT
"partman".run_maintenance(p_analyze := true, p_jobmon := true) called by
role keith on database keith
2016-12-20 11:56:46 EST [] [14496]: [10-1] user=,db=,e=00000 DEBUG:
pg_partman dynamic BGW shutting down gracefully for database keith.
2016-12-20 11:56:46 EST [] [14486]: [10-1] user=,db=,e=00000 DEBUG: worker
process: pg_partman dynamic background worker (dbname=keith) (PID 14496)
exited with exit code 0
2016-12-20 11:56:46 EST [] [14486]: [11-1] user=,db=,e=00000 DEBUG:
unregistering background worker "pg_partman dynamic background worker
(dbname=keith)"
2016-12-20 11:57:16 EST [] [14506]: [1-1] user=,db=,e=00000 DEBUG:
autovacuum: processing database "keith"
2016-12-20 11:57:46 EST [] [14515]: [1-1] user=,db=,e=00000 DEBUG:
autovacuum: processing database "postgres"
2

Attachment Content-Type Size
pg_partman_bgw.c text/x-csrc 18.7 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Mark Kirkwood 2016-12-20 22:31:16 Re: pg_dump's results have quite different size
Previous Message David G. Johnston 2016-12-20 15:30:54 Re: Postgres8.3 replication issue