From: | "James L(dot) Morton" <jmorton(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: BUG #9551: Hang in State "authentication" Prevents Vacuum from Freeing Dead Rows |
Date: | 2014-03-14 20:54:01 |
Message-ID: | CALzMLzwMd6BqQTyyi9d00BbN7x7LZBRKNVbTpzvm0b9597=0-Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
Our authentication_timeout value is set to the default value (1 minute).
We've added monitoring for this particular problem and saw this happen
once more today. We noticed that this seems to correlate with a burst of
log messages from auditd:
Mar 14 18:06:29 XXXX kernel: printk: 2675 messages suppressed.
Mar 14 18:06:29 XXXX kernel: audit: audit_backlog=8193 >
audit_backlog_limit=8192
Mar 14 18:06:29 XXXX kernel: audit: audit_lost=958159 audit_rate_limit=0
audit_backlog_limit=8192
We looked back at old log files, and we see the same error also occurred
the previous times we've encountered this problem.
When auditd exhausts its buffer, by default it begins logging to printk(),
which in turn gets picked up by syslog. Perhaps this burst of messages is
causing syslog to pause long enough to reach authentication_timeout? It
seems odd, especially given that we have our Postgres syslog facility set
to async.
Some searching shows that the PgPool project ran into a similar-looking
deadlock while calling syslog in a signal handler:
http://www.pgpool.net/mantisbt/view.php?id=9
Thanks,
James Morton
On Thu, Mar 13, 2014 at 12:19 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> jmorton(at)gmail(dot)com writes:
> > (gdb) bt
> > #0 0x00000030820e104e in __lll_lock_wait_private () from
> /lib64/libc.so.6
> > #1 0x00000030820d1bf7 in _L_lock_646 () from /lib64/libc.so.6
> > #2 0x00000030820d168a in __vsyslog_chk () from /lib64/libc.so.6
> > #3 0x00000030820d1a73 in __syslog_chk () from /lib64/libc.so.6
> > #4 0x00000000006dd6f1 in EmitErrorReport ()
> > #5 0x00000000006ddaf5 in errfinish ()
> > #6 0x0000000000623f6a in ProcessInterrupts ()
> > #7 0x000000000062439c in StatementCancelHandler ()
> > #8 <signal handler called>
> > #9 0x00000030820306f7 in kill () from /lib64/libc.so.6
> > #10 0x0000000000616d01 in ?? ()
> > #11 0x0000000000616e3f in handle_sig_alarm ()
> > #12 <signal handler called>
> > #13 0x00000030820d5eb5 in send () from /lib64/libc.so.6
> > #14 0x00000030820d1750 in __vsyslog_chk () from /lib64/libc.so.6
> > #15 0x00000030820d1a73 in __syslog_chk () from /lib64/libc.so.6
> > #16 0x00000000006dd6f1 in EmitErrorReport ()
> > #17 0x00000000006ddaf5 in errfinish ()
> > #18 0x0000000000583c57 in ?? ()
> > #19 0x0000000000583c81 in ?? ()
> > #20 0x00000000005858af in ClientAuthentication ()
> > #21 0x00000000006e9bb6 in InitPostgres ()
> > #22 0x000000000062609d in PostgresMain ()
> > #23 0x00000000005e9dfc in ?? ()
> > #24 0x00000000005eab81 in PostmasterMain ()
> > #25 0x000000000058e38e in main ()
>
> Interesting stack trace. Evidently, syslog() on your platform has some
> internal mutex. What seems to have happened is that SIGALRM interrupted
> an attempt to log a client-authentication-related message, and then the
> error message that it attempted to print resulted in recursive entry to
> syslog(), causing a hang since the outer invocation already had the mutex.
>
> This pathway would be impossible if the error level were ERROR, because
> errfinish() clears ImmediateInterruptOK in that case, which would prevent
> the SIGALRM interrupt handler from doing anything much. A likely bet
> though is that the error level in the outer call was FATAL (ie,
> authentication failure); or it might've been just a LOG message.
>
> I'm thinking the most robust solution to this is to have errfinish()
> do this at entry:
>
> bool save_ImmediateInterruptOK = ImmediateInterruptOK;
> ImmediateInterruptOK = false;
>
> and then, *if* it decides to return to caller, do this just before
> returning:
>
> ImmediateInterruptOK = save_ImmediateInterruptOK;
> if (save_ImmediateInterruptOK)
> CHECK_FOR_INTERRUPTS();
>
> This should make the error reporting logic safe against interrupts even
> if invoked while ImmediateInterruptOK is true; which probably shouldn't
> happen during normal backend operation but is definitely possible during
> client authentication.
>
> BTW, are you using an unusually short value for authentication_timeout?
> This case doesn't seem likely to arise with normal values, since the
> timeout isn't expected to occur unless something's totally wedged.
>
> regards, tom lane
>
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2014-03-14 21:54:38 | Re: BUG #9551: Hang in State "authentication" Prevents Vacuum from Freeing Dead Rows |
Previous Message | Kevin Grittner | 2014-03-14 19:43:14 | Re: Re: BUG #9301: INSERT WHERE NOT EXISTS on table with UNIQUE constraint in concurrent SERIALIZABLE transactions |