RE: Forget close an open relation in ReorderBufferProcessTXN()

From: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>
To: 'Amit Langote' <amitlangote09(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Japin Li <japinli(at)hotmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Forget close an open relation in ReorderBufferProcessTXN()
Date: 2021-05-21 12:44:31
Message-ID: OSBPR01MB48887473DB79D2B29A74CA41ED299@OSBPR01MB4888.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Friday, May 21, 2021 4:43 PM Amit Langote <amitlangote09(at)gmail(dot)com> wrote:
> On Fri, May 21, 2021 at 3:55 PM osumi(dot)takamichi(at)fujitsu(dot)com
> <osumi(dot)takamichi(at)fujitsu(dot)com> wrote:
> > But, I've detected segmentation faults caused by the patch, which can
> > happen during 100_bugs.pl in src/test/subscription.
> > This happens more than one in ten times.
> >
> > This problem would be a timing issue and has been introduced by v3
> already.
> > I used v5 for HEAD also and reproduced this failure, while OSS HEAD
> > doesn't reproduce this, even when I executed 100_bugs.pl 200 times in a
> tight loop.
> > I aligned the commit id 4f586fe2 for all check. Below logs are ones I got from
> v3.
> >
> > My first guess of the cause is that between the timing to get an entry
> > from hash_search() in get_rel_sync_entry() and to set the map by
> > convert_tuples_by_name() in maybe_send_schema(), we had invalidation
> message, which tries to free unset descs in the entry ?
>
> Hmm, maybe get_rel_syn_entry() should explicitly set map to NULL when first
> initializing an entry. It's possible that without doing so, the map remains set
> to a garbage value, which causes the invalidation callback that runs into such
> partially initialized entry to segfault upon trying to deference that garbage
> pointer.
Just in case, I prepared a new PG and
did a check to make get_rel_sync_entry() print its first pointer value with elog.
Here, when I executed 100_bugs.pl, I got some garbage below.

* The change I did:
@@ -1011,6 +1011,7 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
entry->pubactions.pubinsert = entry->pubactions.pubupdate =
entry->pubactions.pubdelete = entry->pubactions.pubtruncate = false;
entry->publish_as_relid = InvalidOid;
+ elog(LOG, "**> the pointer's default value : %p", entry->map);
}

* Grep result of all logs from 100_bugs.pl
2021-05-21 09:05:56.132 UTC [29122] sub1 LOG: **> the pointer's default value : (nil)
2021-05-21 09:06:11.556 UTC [30198] testsub1 LOG: **> the pointer's default value : (nil)
2021-05-21 09:06:11.561 UTC [30200] pg_16389_sync_16384_6964667281140237667 LOG: **> the pointer's default value : 0x7f7f7f7f7f7f7f7f
2021-05-21 09:06:11.567 UTC [30191] testsub2 LOG: **> the pointer's default value : (nil)
2021-05-21 09:06:11.570 UTC [30194] pg_16387_sync_16384_6964667292923737489 LOG: **> the pointer's default value : 0x7f7f7f7f7f7f7f7f
2021-05-21 09:06:02.513 UTC [29809] testsub LOG: **> the pointer's default value : (nil)
2021-05-21 09:06:02.557 UTC [29809] testsub LOG: **> the pointer's default value : (nil)

So, your solution is right, I think.

> I've tried that in the attached v6 patches. Please check.
With this fix, I don't get the failure.
I executed 100_bugs.pl 100 times in a loop and didn't face that problem.

Again, I conducted one make check-world for each combination
* use OSS HEAD or PG13
* apply only the first patch or both two patches
Those all passed.

Best Regards,
Takamichi Osumi

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Guillaume Lelarge 2021-05-21 13:17:43 Re: "Multiple table synchronizations are processed serially" still happens
Previous Message Andrew Dunstan 2021-05-21 12:38:50 Re: Installation of regress.so?