RE: "unexpected duplicate for tablespace" problem in logical replication

From: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>
To: "wangsh(dot)fnst(at)fujitsu(dot)com" <wangsh(dot)fnst(at)fujitsu(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: RE: "unexpected duplicate for tablespace" problem in logical replication
Date: 2022-04-08 09:44:07
Message-ID: TYCPR01MB83731ADE7FD7C7CF5D335BCEEDE99@TYCPR01MB8373.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Wednesday, April 6, 2022 11:14 AM wangsh(dot)fnst(at)fujitsu(dot)com <wangsh(dot)fnst(at)fujitsu(dot)com> wrote:
> I met a problem while using logical replication in PG11 and I think all the PG
> version have this problem.
>
>
> The log looks like:
> > ERROR: unexpected duplicate for tablespace 0, relfilenode xxxxxxx
> Someone also reported this problem in [1], but no one has responded to it.
>
>
>
> I did some investigation, and found a way to reproduce this problem. The
> steps are:
>
>
> 1. create a table (call it tableX) and truncate it.
>
>
> 2. cycle through 2^32 OIDs.
>
>
> 3. restart the database to clear all the cache.
>
>
> 4. create a temp table which make the temp table's OID equals to the
> tableX's relfilenode and insert any data into tableX.
>
>
> The attachment(run.sh) can reproduce this problem in PG10 and PG11with
> the help of option 'WITH OIDS'. I don't find any way to cycle the OIDs
> quickly in branch master, but I use the gdb to reproduce this problem too.
>
>
>
> Now, function GetNewRelFileNode() only checks:
>
>
> 1. duplicated OIDs in pg_class.
>
>
> 2. relpath(rnode) is exists in disk.
>
>
> However, the result of relpath(temp table) and relpath(non-temp table)
> are different, temp table's relpath() has a prefix "t%d". That means, if
> there is a table that value of relfilenode is 20000(but the value of oid
> isn't 20000), it's possible to create a temp table that value of
> relfilenode is also 20000.
>
>
> I think function GetNewRelFileNode() should always check the duplicated
> relfilenode, see the patch(a simple to way to fix this problem is master
> branch).
>
>
> Any comment?
Hi, thank you for your report.

It seems correct that there's room that wraparounded oid can be used for temp table,
and we get duplicate result when we retrieve it and face the error.

I reproduced your issue with HEAD and gdb, by replacing rnode.node.relNode with
an existing relfilenode in GetNewRelFileNode(), immediately before the call of relpath().

At the same time, I also confirmed that with your patch, I had another iteration
call of GetNewOidWithIndex() and got different oid, even if I inserted an existing
relfilenode in the same way. I didn't get the same error with your patch.

Below are my several review comments on your patch.

(1) Name of isRelNodeCollision

How about changing it to "IsCollidedRelNode" ?

When I see around of this code, there are functions
named as "IsReservedName", "IsSharedRelation" or "IsPinnedObject".
So, my suggestion would be more aligned.

(2) Remove curly brackets for one sentence

+ if (pg_class != NULL)
+ {
+ usedAsOID = true;
+ }
+ else
+ {
+ pg_class = table_open(RelationRelationId, AccessShareLock);
+ usedAsOID = false;
+ }
+

Please remove curly brackets for one sentence.

Also, I suggest you add a comment something like
"Ensure we have opened pg_class in this function"
at the top of this check for clarification.

(3) variables of isRelNodeCollision

+ SysScanDesc scandesc;
+ static ScanKeyData skey[2];

We can move those declaration into the inside of
the conditional branch for scankey.

(4) Having Assert in isRelNodeCollision

How about add an Assert to check the argument
relation is not NULL ?

(5) argument name of isRelNodeCollision

I suggest you change it to pg_class.

Best Regards,
Takamichi Osumi

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message osumi.takamichi@fujitsu.com 2022-04-11 07:40:11 RE: "unexpected duplicate for tablespace" problem in logical replication
Previous Message PG Bug reporting form 2022-04-06 21:08:44 BUG #17457: pg_rewind fatal failure when $PGDATA/log is a soft-link