From: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com> |
---|---|
To: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Subject: | RE: Perform streaming logical transactions by background workers and parallel apply |
Date: | 2023-04-26 10:41:22 |
Message-ID: | OS0PR01MB57164DF9FC5366024A1952D594659@OS0PR01MB5716.jpnprd01.prod.outlook.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wednesday, April 26, 2023 5:00 PM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
> Please look at a new anomaly that can be observed starting from 216a7848.
>
> The following script:
> echo "CREATE SUBSCRIPTION testsub CONNECTION 'dbname=nodb'
> PUBLICATION testpub WITH (connect = false);
> ALTER SUBSCRIPTION testsub ENABLE;" | psql
>
> sleep 1
> rm $PGINST/lib/libpqwalreceiver.so
> sleep 15
> pg_ctl -D "$PGDB" stop -m immediate
> grep 'TRAP:' server.log
>
> Leads to multiple assertion failures:
> CREATE SUBSCRIPTION
> ALTER SUBSCRIPTION
> waiting for server to shut down.... done
> server stopped
> TRAP: failed Assert("MyProc->backendId != InvalidBackendId"), File: "lock.c",
> Line: 4439, PID: 2899323
> TRAP: failed Assert("MyProc->backendId != InvalidBackendId"), File: "lock.c",
> Line: 4439, PID: 2899416
> TRAP: failed Assert("MyProc->backendId != InvalidBackendId"), File: "lock.c",
> Line: 4439, PID: 2899427
> TRAP: failed Assert("MyProc->backendId != InvalidBackendId"), File: "lock.c",
> Line: 4439, PID: 2899439
> TRAP: failed Assert("MyProc->backendId != InvalidBackendId"), File: "lock.c",
> Line: 4439, PID: 2899538
> TRAP: failed Assert("MyProc->backendId != InvalidBackendId"), File: "lock.c",
> Line: 4439, PID: 2899547
>
> server.log contains:
> 2023-04-26 11:00:58.797 MSK [2899300] LOG: database system is ready to
> accept connections
> 2023-04-26 11:00:58.821 MSK [2899416] ERROR: could not access file
> "libpqwalreceiver": No such file or directory
> TRAP: failed Assert("MyProc->backendId != InvalidBackendId"), File: "lock.c",
> Line: 4439, PID: 2899416
> postgres: logical replication apply worker for subscription 16385
> (ExceptionalCondition+0x69)[0x558b2ac06d41]
> postgres: logical replication apply worker for subscription 16385
> (VirtualXactLockTableCleanup+0xa4)[0x558b2aa9fd74]
> postgres: logical replication apply worker for subscription 16385
> (LockReleaseAll+0xbb)[0x558b2aa9fe7d]
> postgres: logical replication apply worker for subscription 16385
> (+0x4588c6)[0x558b2aa2a8c6]
> postgres: logical replication apply worker for subscription 16385
> (shmem_exit+0x6c)[0x558b2aa87eb1]
> postgres: logical replication apply worker for subscription 16385
> (+0x4b5faa)[0x558b2aa87faa]
> postgres: logical replication apply worker for subscription 16385
> (proc_exit+0xc)[0x558b2aa88031]
> postgres: logical replication apply worker for subscription 16385
> (StartBackgroundWorker+0x147)[0x558b2aa0b4d9]
> postgres: logical replication apply worker for subscription 16385
> (+0x43fdc1)[0x558b2aa11dc1]
> postgres: logical replication apply worker for subscription 16385
> (+0x43ff3d)[0x558b2aa11f3d]
> postgres: logical replication apply worker for subscription 16385
> (+0x440866)[0x558b2aa12866]
> postgres: logical replication apply worker for subscription 16385
> (+0x440e12)[0x558b2aa12e12]
> postgres: logical replication apply worker for subscription 16385
> (BackgroundWorkerInitializeConnection+0x0)[0x558b2aa14396]
> postgres: logical replication apply worker for subscription 16385
> (main+0x21a)[0x558b2a932e21]
>
> I understand, that removing libpqwalreceiver.so (or whole pginst/) is not
> what happens in a production environment every day, but nonetheless it's a
> new failure mode and it can produce many coredumps when testing.
>
> IIUC, that assert will fail in case of any error raised between
> ApplyWorkerMain()->logicalrep_worker_attach()->before_shmem_exit() and
> ApplyWorkerMain()->InitializeApplyWorker()->BackgroundWorkerInitializeC
> onnectionByOid()->InitPostgres().
Thanks for reporting the issue.
I think the problem is that it tried to release locks in
logicalrep_worker_onexit() before the initialization of the process is complete
because this callback function was registered before the init phase. So I think we
can add a conditional statement before releasing locks. Please find an attached
patch.
Best Regards,
Hou zj
Attachment | Content-Type | Size |
---|---|---|
0001-fix-assert-failure-in-logical-replication-apply-work.patch | application/octet-stream | 1.3 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Daniel Gustafsson | 2023-04-26 11:02:26 | Re: Find dangling membership roles in pg_dumpall |
Previous Message | Alvaro Herrera | 2023-04-26 10:27:51 | Re: Add two missing tests in 035_standby_logical_decoding.pl |