Re: SV: Problem with pg_notify / listen

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Gustavsson Mikael <mikael(dot)gustavsson(at)smhi(dot)se>, pgsql-bugs(at)lists(dot)postgresql(dot)org, Svensson Peter <peter(dot)svensson(at)smhi(dot)se>, Almen Anders <anders(dot)almen(at)smhi(dot)se>
Subject: Re: SV: Problem with pg_notify / listen
Date: 2020-11-27 22:13:28
Message-ID: 730247.1606515208@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-general

I wrote:
> * Change the code back to being atomic, ie go ahead and update
> QUEUE_TAIL immediately, and truncate the SLRU only afterward.
> Why is it necessary, or even safe, to perform the SLRU truncation
> before changing QUEUE_TAIL? (IOW, I wonder if there isn't a bug
> there in HEAD too.)

After thinking more about that, I'm pretty sure there is a bug there:
a newly-arriving backend could attempt to scan the queue starting at
QUEUE_TAIL, concurrently with SimpleLruTruncate removing the page(s)
it wants to scan. In typical cases no failure will occur because
Exec_ListenPreCommit could advance its queue pointer to a safe place
by observing the pointers of other backends. However, if the new
listener were the only one in its database, we'd have trouble.

What seems like the most expedient way to fix this is to separate
QUEUE_TAIL into two variables, one that is the "logical" tail
position from which new backends may start to scan the queue,
and one that is the "physical" tail position, ie, the oldest
page we have not yet truncated. The physical tail need only be
tracked to page accuracy, so it can be a plain int not a QUEUE_POS.
asyncQueueAdvanceTail should update QUEUE_TAIL immediately, which
restores correct behavior pre-v13 and also dodges the race condition
described above. But we don't update the physical tail pointer
until we've completed SimpleLruTruncate, keeping things honest
with respect to asyncQueueIsFull.

As a minor side benefit, asyncQueueAdvanceTail doesn't have to take
the NotifyQueueLock twice unless it actually does an SLRU truncation.

In short, I propose the attached patch (against HEAD, but the
same logic changes should work in the back branches).

regards, tom lane

Attachment Content-Type Size
fix-notify-race-conditions.patch text/x-diff 4.8 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Noah Misch 2020-11-28 03:59:06 Re: SV: Problem with pg_notify / listen
Previous Message Tom Lane 2020-11-27 19:07:42 Re: BUG #16749: EXPLAIN only shows filtering from the first query in a union chain when using distinct on.

Browse pgsql-general by date

  From Date Subject
Next Message Noah Misch 2020-11-28 03:59:06 Re: SV: Problem with pg_notify / listen
Previous Message David G. Johnston 2020-11-27 21:25:25 Re: postgres_fdw insert extremely slow