From: | feichanghong <feichanghong(at)qq(dot)com> |
---|---|
To: | pgsql-bugs <pgsql-bugs(at)lists(dot)postgresql(dot)org> |
Cc: | andres <andres(at)anarazel(dot)de>, sawada(dot)mshk <sawada(dot)mshk(at)gmail(dot)com>, horikyota(dot)ntt <horikyota(dot)ntt(at)gmail(dot)com> |
Subject: | ReplicationSlotRelease may set the statusFlags of other processes in PG14 |
Date: | 2024-03-16 14:29:03 |
Message-ID: | tencent_CA7420C82971BC4B64F0748A7D2898A4720A@qq.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
Hello hackers,
I encountered an issue where an assertion failure occurred for
ProcGlobal->statusFlags. The detailed analysis is as follows:
# Issue background
PostgreSQL version: REL_14_STABLE v14.11 3621ffd9f2
The assert failure occurs when the autovacuum worker process commits a
transaction: proc->statusFlags is 3, and ProcGlobal->statusFlags[proc->pgxactoff]
is 0. The specific stack trace is as follows:
```
#0 0x00007f174c401277 in raise () from /lib64/libc.so.6
#1 0x00007f174c402968 in abort () from /lib64/libc.so.6
#2 0x00000000010f1403 in ExceptionalCondition (conditionName=0x1728760 "proc->statusFlags == ProcGlobal->statusFlags[proc->pgxactoff]", errorType=0x1728420 "FailedAssertion", fileName=0x17284b8 "procarray.c", lineNumber=776) at assert.c:44
#3 0x0000000000e997d0 in ProcArrayEndTransaction (proc=0x7f1731630f00, latestXid=0) at procarray.c:776
#4 0x00000000008cfb18 in CommitTransaction () at xact.c:2447
#5 0x00000000008d090e in CommitTransactionCommand () at xact.c:3293
#6 0x0000000000bb5652 in vacuum_rel (relid=2608, relation=0x56f76f8, params=0x56e399c) at vacuum.c:2174
#7 0x0000000000bb300e in vacuum (relations=0x5535b40, params=0x56e399c, bstrategy=0x56e3a30, isTopLevel=true) at vacuum.c:486
#8 0x0000000000dc01a9 in autovacuum_do_vac_analyze (tab=0x56e3998, bstrategy=0x56e3a30) at autovacuum.c:3467
#9 0x0000000000dbeb94 in do_autovacuum () at autovacuum.c:2632
#10 0x0000000000dbd530 in AutoVacWorkerMain (argc=0, argv=0x0) at autovacuum.c:1769
#11 0x0000000000dbd12a in StartAutoVacWorker () at autovacuum.c:1545
#12 0x0000000000dd894f in StartAutovacuumWorker () at postmaster.c:6441
#13 0x0000000000dd8189 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:6108
#14 <signal handler called>
#15 0x00007f174c4bffc3 in __select_nocancel () from /lib64/libc.so.6
#16 0x0000000000dd1d67 in ServerLoop () at postmaster.c:1881
#17 0x0000000000dd16d9 in PostmasterMain (argc=3, argv=0x533e060) at postmaster.c:1582
#18 0x0000000000ca4af5 in main (argc=3, argv=0x533e060) at main.c:216
```
# Analysis and Reproduction
## Analysis
A process utilizing replication slots (usually walsender) calls callback
functions in the order of RemoveProcFromArray->ProcKill upon abnormal exit.
Within RemoveProcFromArray, MyProc is already removed from the ProcArray.
ProcKill then attempts to set ProcGlobal->statusFlags[MyProc->pgxactoff] again
via ReplicationSlotRelease. By this time, the flag may already be assigned to
another process.
## Reproduction
This issue only occurs in PG14. In versions prior to PG14, ProcGlobal->statusFlags
had not been refactored. In versions following PG14, it has been fixed by commit
2f6501f (discussed in [1]). Therefore, I've cc'd Masahiko, Kyotaro, and Andres in
hopes that they can provide some insights.
To replicate the issue, execute the following steps:
1. Apply the attached v1-0000-v14-invalidate-pgxactoff-after-remove-pgproc.patch,
where pgxactoff is set to an invalid value in ProcArrayRemove, and some
checks are added.
2. Use the SQL below to terminate the walsender process.
```
select pg_terminate_backend(pid) from pg_stat_activity where backend_type = 'walsender';
```
# Fix
To fix the issue, I have provided some patches in the attachment:
1. Backpatching 2f6501f into the PG14 version will fix the problem.
2. In PG14-head, ProcArrayRemove needs to reset pgxactoff, and some assert
checks should be done when setting ProcGlobal->statusFlags.
[1] https://www.postgresql.org/message-id/29273AF3-9684-4069-9257-D05090B03B99@amazon.com
Best Regards,
Fei Changhong
Alibaba Cloud Computing Ltd.
Attachment | Content-Type | Size |
---|---|---|
v1-0000-head-invalidate-pgxactoff-after-remove-pgproc.patch | application/octet-stream | 5.4 KB |
v1-0000-v14-backpatch-2f6501f.patch | application/octet-stream | 5.2 KB |
v1-0000-v14-invalidate-pgxactoff-after-remove-pgproc.patch | application/octet-stream | 4.9 KB |
v1-0000-v15-v16-invalidate-pgxactoff-after-remove-pgproc.patch | application/octet-stream | 5.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Wolfgang Walther | 2024-03-16 15:55:49 | Re: Regression tests fail with musl libc because libpq.so can't be loaded |
Previous Message | Tom Lane | 2024-03-16 14:24:43 | Re: Regression tests fail with musl libc because libpq.so can't be loaded |