Re: LWLock deadlock in brinRevmapDesummarizeRange

From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: LWLock deadlock in brinRevmapDesummarizeRange
Date: 2023-02-22 11:35:32
Message-ID: 20230222113532.omcjtw2jh6lztu72@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2023-Feb-22, Tomas Vondra wrote:

> But instead of I almost immediately ran into a LWLock deadlock :-(

Ouch.

> I've managed to reproduce this on PG13+, but I believe it's there since
> the brinRevmapDesummarizeRange was introduced in PG10. I just haven't
> tried on pre-13 releases.

Hmm, I think that might just be an "easy" way to hit it, but the problem
is actually older than that, since AFAICS brin_doupdate is careless
regarding locking order of revmap page vs. regular page.

Sadly, the README doesn't cover locking considerations. I had that in a
file called 'minmax-proposal' in version 16 of the patch here
https://postgr.es/m/20140820225133.GB6343@eldon.alvh.no-ip.org
but by version 18 (when 'minmax' became BRIN) I seem to have removed
that file and replaced it with the README and apparently I didn't copy
this material over.

... and in there, I wrote that we would first write the brin tuple in
the regular page, unlock that, and then lock the revmap for the update,
without holding lock on the data page. I don't remember why we do it
differently now, but maybe the fix is just to release the regular page
lock before locking the revmap page? One very important change is that
in previous versions the revmap used a separate fork, and we had to
introduce an "evacuation protocol" when we integrated the revmap into
the main fork, which may have changed the locking considerations.

Another point: to desummarize a range, just unlinking the entry from
revmap should suffice, from the POV of other index scanners. Maybe we
can simplify the whole procedure to: lock revmap, remove entry, remember
page number, unlock revmap; lock regular page, delete entry, unlock.
Then there are no two locks held at the same time during desummarize.

This comes from v16:

+ Locking considerations
+ ----------------------
+
+ To read the TID during an index scan, we follow this protocol:
+
+ * read revmap page
+ * obtain share lock on the revmap buffer
+ * read the TID
+ * obtain share lock on buffer of main fork
+ * LockTuple the TID (using the index as relation). A shared lock is
+ sufficient. We need the LockTuple to prevent VACUUM from recycling
+ the index tuple; see below.
+ * release revmap buffer lock
+ * read the index tuple
+ * release the tuple lock
+ * release main fork buffer lock
+
+
+ To update the summary tuple for a page range, we use this protocol:
+
+ * insert a new index tuple somewhere in the main fork; note its TID
+ * read revmap page
+ * obtain exclusive lock on revmap buffer
+ * write the TID
+ * release lock
+
+ This ensures no concurrent reader can obtain a partially-written TID.
+ Note we don't need a tuple lock here. Concurrent scans don't have to
+ worry about whether they got the old or new index tuple: if they get the
+ old one, the tighter values are okay from a correctness standpoint because
+ due to MVCC they can't possibly see the just-inserted heap tuples anyway.
+
+ [vacuum stuff elided]

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Escucha y olvidarás; ve y recordarás; haz y entenderás" (Confucio)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2023-02-22 11:47:34 unsafe_tests module
Previous Message Dean Rasheed 2023-02-22 11:01:51 Re: Assert failure with MERGE into partitioned table with RLS