Re: BUG #18815: Logical replication worker Segmentation fault

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc: Sergey Belyashov <sergey(dot)belyashov(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>
Subject: Re: BUG #18815: Logical replication worker Segmentation fault
Date: 2025-02-18 20:38:34
Message-ID: 1761873.1739911114@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

[ redirecting to -hackers for patch discussion ]

Here's a completed set of patches for bug #18815 [1] (which,
briefly, is that the logrep worker opens/closes indexes
multiple times and thereby breaks brininsertcleanup).

0001 adds a subscriber-side BRIN index to 013_partition.pl, which
replicates the crash Sergey reported. I've got mixed feelings about
whether to actually commit this: it's certainly useful for testing
this fix, but without context it looks like a pretty random thing to
do. It could be justified perhaps on the grounds of testing
aminsertcleanup callbacks within replication, since BRIN is the
only core index type that has such a callback.

0002 fixes the crash by hardening brininsertcleanup against multiple
calls. I think that this is patching a symptom not the root cause,
but it still seems like good defensive programming.

0003 adds Asserts to ExecOpenIndices and ExecCloseIndices to check
that they're not called more than once per ResultRelInfo, and thereby
exposes what I consider the root cause: apply_handle_tuple_routing
opens the indexes twice and then closes them twice. This somewhat
accidentally leaves us with zero refcounts and zero locks on the
indexes, so in the absence of aminsertcleanup callbacks the only real
reason to complain is the duplicative construction of the IndexInfo
structures. But the double call of brininsertcleanup breaks the
existing coding of that function, and it might well break third-party
aminsertcleanup callbacks if there are any. So I think we should
institute a policy that this coding pattern is forbidden.

And finally, 0004 fixes worker.c to not do that. This turned out
simpler than I thought, because I was misled by believing that the
ExecOpenIndices/ExecCloseIndices calls in apply_handle_tuple_routing
itself do something useful. They don't, and should be nuked outright.

I don't intend 0003 for back-patch, but the rest of this has to go
back as far as the bug can be observed, which I didn't test yet.
To be clear, 0004 would fix the issue even without 0002, but
I still think we should back-patch both.

regards, tom lane

[1] https://www.postgresql.org/message-id/flat/18815-2a0407cc7f40b327%40postgresql.org

Attachment Content-Type Size
v1-0001-Demonstrate-crash-in-brininsertcleanup-during-log.patch text/x-diff 1.3 KB
v1-0002-Harden-brininsertcleanup-against-repeat-calls.patch text/x-diff 1.7 KB
v1-0003-Assert-that-ExecOpenIndices-and-ExecCloseIndices-.patch text/x-diff 2.5 KB
v1-0004-Avoid-duplicate-ExecOpenIndices-ExecCloseIndices-.patch text/x-diff 4.1 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Bruce Momjian 2025-02-18 20:52:08 Re: Inconsistency of timezones in postgresql
Previous Message Sergey Belyashov 2025-02-18 17:41:55 Re: BUG #18815: Logical replication worker Segmentation fault

Browse pgsql-hackers by date

  From Date Subject
Next Message Rafael Thofehrn Castro 2025-02-18 20:47:58 Re: Proposal: Progressive explain
Previous Message Zsolt Parragi 2025-02-18 20:02:13 Re: Improve pgindent exclude handling: ignore empty lines