Re: BUG #18815: Logical replication worker Segmentation fault

From: Sergey Belyashov <sergey(dot)belyashov(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18815: Logical replication worker Segmentation fault
Date: 2025-02-18 06:56:56
Message-ID: CAOe0RDwUeZduRUcD1N=BcAk5z3ANPpdyZtr4qNjiY6fPQu=sDw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Hi,

Do I need to apply this patch for debugging purposes?

I want to remove brin indexes from active partitions and start
replication. When the issue is fixed I will return brin indexes back.

Best regards,
Sergey Belyashov

вт, 18 февр. 2025 г. в 02:37, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
>
> I wrote:
> > Further to this ... I'd still really like to have a reproducer.
> > While brininsertcleanup is clearly being less robust than it should
> > be, I now suspect that there is another bug somewhere further down
> > the call stack. We're getting to this point via ExecCloseIndices,
> > and that should be paired with ExecOpenIndices, and that would have
> > created a fresh IndexInfo. So it looks a lot like some path in a
> > logrep worker is able to call ExecCloseIndices twice on the same
> > working data. That would probably lead to a "releasing a lock you
> > don't own" error if we weren't hitting this crash first.
>
> Hmm ... I tried modifying ExecCloseIndices to blow up if called
> twice, as in the attached. This gets through core regression
> just fine, but it blows up in three different subscription TAP
> tests, all with a stack trace matching Sergey's:
>
> #0 __GI_raise (sig=sig(at)entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> #1 0x00007f064bfe3e65 in __GI_abort () at abort.c:79
> #2 0x00000000009e9253 in ExceptionalCondition (
> conditionName=conditionName(at)entry=0xb8717b "indexDescs[i] != NULL",
> fileName=fileName(at)entry=0xb87139 "execIndexing.c",
> lineNumber=lineNumber(at)entry=249) at assert.c:66
> #3 0x00000000006f0b13 in ExecCloseIndices (
> resultRelInfo=resultRelInfo(at)entry=0x2f11c18) at execIndexing.c:249
> #4 0x00000000006f86d8 in ExecCleanupTupleRouting (mtstate=0x2ef92d8,
> proute=0x2ef94e8) at execPartition.c:1273
> #5 0x0000000000848cb6 in finish_edata (edata=0x2ef8f50) at worker.c:717
> #6 0x000000000084d0a0 in apply_handle_insert (s=<optimized out>)
> at worker.c:2460
> #7 apply_dispatch (s=<optimized out>) at worker.c:3389
> #8 0x000000000084e494 in LogicalRepApplyLoop (last_received=25066600)
> at worker.c:3680
> #9 start_apply (origin_startpos=0) at worker.c:4507
> #10 0x000000000084e711 in run_apply_worker () at worker.c:4629
> #11 ApplyWorkerMain (main_arg=<optimized out>) at worker.c:4798
> #12 0x00000000008138f9 in BackgroundWorkerMain (startup_data=<optimized out>,
> startup_data_len=<optimized out>) at bgworker.c:842
>
> The problem seems to be that apply_handle_insert_internal does
> ExecOpenIndices and then ExecCloseIndices, and then
> ExecCleanupTupleRouting does ExecCloseIndices again, which nicely
> explains why brininsertcleanup blows up if you happen to have a BRIN
> index involved. What it doesn't explain is how come we don't see
> other symptoms from the duplicate index_close calls, regardless of
> index type. I'd have expected an assertion failure from
> RelationDecrementReferenceCount, and/or an assertion failure for
> nonzero rd_refcnt at transaction end, and/or a "you don't own a lock
> of type X" gripe from LockRelease. We aren't getting any of those,
> but why not, if this code is as broken as I think it is?
>
> (On closer inspection, we seem to have about 99% broken relcache.c's
> ability to notice rd_refcnt being nonzero at transaction end, but
> the other two things should still be happening.)
>
> regards, tom lane
>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2025-02-18 08:21:47 BUG #18817: Security Bug Report: Plaintext Password Exposure in Logs
Previous Message Richard Guo 2025-02-18 06:52:01 Re: BUG #18806: When enable_rartitionwise_join is set to ON, the database shuts down abnormally

Browse pgsql-hackers by date

  From Date Subject
Next Message David G. Johnston 2025-02-18 07:03:26 Re: UUID v7
Previous Message David G. Johnston 2025-02-18 06:49:45 Re: ReplicationSlotRelease() crashes when the instance is in the single user mode