Re: Undetected Deadlock

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Michael Harris <harmic(at)gmail(dot)com>
Cc: Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Undetected Deadlock
Date: 2022-02-06 22:57:10
Message-ID: 794618.1644188230@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Michael Harris <harmic(at)gmail(dot)com> writes:
>> If Michael's analysis were accurate, I'd agree that there is a robustness
>> issue, but I don't think there is. See timeout.c:220:

> Actually that only sets a new timer after the nearest timeout has expired.

Mmm, yeah, you're right: as long as we keep on canceling timeout
requests before they are reached, this code won't notice a problem.
We need to check against signal_due_at, not only the next timeout
request, more or less as attached. Do you want to try this and see
if it actually adds any robustness with your buggy code?

> I was thinking that to improve robustness, we could add a check for
> `now < signal_due_at` to schedule_alarm line 300:

> if (signal_pending && now < signal_due_at && nearest_timeout >= signal_due_at)
> return;

I don't think that merging this issue into that test is appropriate;
the logic is already complicated and hard to explain. Also, it seems
to me that some slop is needed, since as mentioned nearby, the kernel's
opinion of when to fire the interrupt is likely to be a bit later
than ours. I'm not wedded to the 10ms slop proposed below, but it
seems like it's probably in the right ballpark.

> One other thing that struck me when reading this code: the variable
> signal_due_at is not declared as volatile sig_atomic_t, even though it
> is read from / written to by the signal handler. Maybe that could
> cause problems?

Hm. I'm not really convinced there's any issue, since signal_due_at
is only examined when signal_pending is true. Still, it's probably
better for all these variables to be volatile, so done.

> One other question: I've fixed our custom function, so that it
> correctly restores any interval timers that were running, but of
> course if our function takes longer than the remaining time on the
> postgres timer the signal will be delivered late. Beyond the fact that
> a deadlock or statement timeout will take longer than expected, are
> there any other negative consequences? Are any of the timeouts
> time-critical such that being delayed by a few seconds would cause a
> problem?

They're certainly not supposed to be that time-critical; any code that
is expecting such a thing is going to have lots of problems already.
PG is not a hard-real-time system ;-)

regards, tom lane

Attachment Content-Type Size
recover-from-lost-interrupt-1.patch text/x-diff 1.8 KB

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Aaron Sipser 2022-02-08 16:51:09 Locks on FK Tables From Partitioning
Previous Message Ron 2022-02-05 15:14:02 Re: Re. Backup of postgresql database