Re: Undetected Deadlock

From: Michael Harris <harmic(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Undetected Deadlock
Date: 2022-02-04 01:08:58
Message-ID: CADofcAVWMdOfyND0Mf6agsr9mOtjy+Fe=yNGMRdnSP1NYZx6qg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

> If Michael's analysis were accurate, I'd agree that there is a robustness
> issue, but I don't think there is. See timeout.c:220:

Actually that only sets a new timer after the nearest timeout has expired.

The pattern I was seeing went like this:

1. Command occurs during which a signal was not delivered (due to our
custom function). signal_pending is set and never cleared.

2. This command finishes normally, so the deadlock timeout is removed.
As long as there are either no more active timeouts, or the nearest
active timeout is still in the future, then no interval timer is set.

3. A later command starts, and again because the new timeout is in the
future, no new interval timer is set.

4. This transaction gets into a deadlock, but because the interval
timer is not running, ALRM is never received so the process is stuck.

In fact once a backend gets into this state it is permanently stuck
like this because new calls to schedule timeouts are always scheduling
them for a future time.

I was thinking that to improve robustness, we could add a check for
`now < signal_due_at` to schedule_alarm line 300:

if (signal_pending && now < signal_due_at && nearest_timeout
>= signal_due_at)
return;

That way, even if we think we are due for a signal, if we are already
past the time we were expecting it we go ahead and set a new one
anyway.

Even though the fault in this case was a faulty custom function, there
could be other scenarios in which a signal isn't delivered (that bit
of code Tom highlighted is specifically to cater for that, going by
the comments).

One other thing that struck me when reading this code: the variable
signal_due_at is not declared as volatile sig_atomic_t, even though it
is read from / written to by the signal handler. Maybe that could
cause problems?

One other question: I've fixed our custom function, so that it
correctly restores any interval timers that were running, but of
course if our function takes longer than the remaining time on the
postgres timer the signal will be delivered late. Beyond the fact that
a deadlock or statement timeout will take longer than expected, are
there any other negative consequences? Are any of the timeouts
time-critical such that being delayed by a few seconds would cause a
problem?

Thanks for any feedback.

Regards
Mike

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message aditya desai 2022-02-04 03:50:45 Re: Increase fetch fize of oracl_fdw(ALTER SERVER)
Previous Message Abhishek Bhola 2022-02-03 23:02:31 Re: Subscription stuck at initialize state