From: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Sergey Belyashov <sergey(dot)belyashov(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | RE: BUG #18815: Logical replication worker Segmentation fault |
Date: | 2025-03-26 05:07:47 |
Message-ID: | TYAPR01MB57244ADA33DDA57119B9D26494A62@TYAPR01MB5724.jpnprd01.prod.outlook.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs pgsql-hackers |
On Wed, Feb 19, 2025 at 4:38 AM Tom Lane wrote:
Hi,
>
> Here's a completed set of patches for bug #18815 [1] (which, briefly,
> is that the logrep worker opens/closes indexes multiple times and
> thereby breaks brininsertcleanup).
>
> 0001 adds a subscriber-side BRIN index to 013_partition.pl, which
> replicates the crash Sergey reported. I've got mixed feelings about
> whether to actually commit this: it's certainly useful for testing
> this fix, but without context it looks like a pretty random thing to
> do. It could be justified perhaps on the grounds of testing
> aminsertcleanup callbacks within replication, since BRIN is the only core index type that has such a callback.
>
> 0002 fixes the crash by hardening brininsertcleanup against multiple
> calls. I think that this is patching a symptom not the root cause,
> but it still seems like good defensive programming.
>
> 0003 adds Asserts to ExecOpenIndices and ExecCloseIndices to check
> that they're not called more than once per ResultRelInfo, and thereby
> exposes what I consider the root cause: apply_handle_tuple_routing
> opens the indexes twice and then closes them twice. This somewhat
> accidentally leaves us with zero refcounts and zero locks on the
> indexes, so in the absence of aminsertcleanup callbacks the only real
> reason to complain is the duplicative construction of the IndexInfo
> structures. But the double call of brininsertcleanup breaks the
> existing coding of that function, and it might well break third-party
> aminsertcleanup callbacks if there are any. So I think we should
> institute a policy that this coding pattern is forbidden.
>
> And finally, 0004 fixes worker.c to not do that. This turned out
> simpler than I thought, because I was misled by believing that the
> ExecOpenIndices/ExecCloseIndices calls in apply_handle_tuple_routing
> itself do something useful. They don't, and should be nuked outright.
My colleague Nisha reported off-list about a crash in logical replication that
occurs during unique constraint violations on leaf partitions. Upon
investigation, I confirmed that this crash started happening after commit
9ff6867.
The problem arises because unique key conflict detection relied on the
ExecOpenIndices call in apply_handle_insert_internal and
apply_handle_tuple_routing with the speculative parameter set to true to
construct necessary index information. However, these openings were redundant,
as indexes had already been opened during target partition searches via the
parent table (e.g., using ExecFindPartition). Hence, they were removed in
commit 9ff6867. Unfortunately, ExecFindPartition opens indexes without
constructing the needed index information for conflict detection, which leads
to the crash.
To fix it, I tried to add a detect_conflict flag in ModifyTableState, enabling
ExecFindPartition() to internally build the required index information in this
case, like the attachment.
An alternative approach may be to delay index information initialization until
immediately before executing the actual update or insert. E.g., do that in
InitConflictIndexes(). This approach can also avoid unnecessary construction of
index information when the tuple to be updated is not found, or during a
cross-partition update where index information for the source partition is not
required. However, it introduces an additional location for index
initialization, potentially increasing maintenance efforts.
Best Regards,
Hou zj
Attachment | Content-Type | Size |
---|---|---|
v1-0001-Fix-crashes-in-logical-replication-during-unique-.patch | application/octet-stream | 5.6 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | PG Bug reporting form | 2025-03-26 10:13:59 | BUG #18867: /src/interfaces/ecpg/preproc/descriptor.c usage of ECPGdump_a_type |
Previous Message | Euler Taveira | 2025-03-26 01:55:52 | Re: BUG #18866: Running pg_freespace() on views triggers an Abort |
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2025-03-26 05:10:05 | Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided. |
Previous Message | jian he | 2025-03-26 05:01:18 | ALTER COLUMN SET DATA TYPE does not change the generation expression's collation |