From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
---|---|
To: | pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: BUG #18509: Logical decoding behaves badly when processing a change record for a table with altered column |
Date: | 2024-07-07 11:00:00 |
Message-ID: | be69a11d-b7b5-9ee8-221a-fdf34898bb6d@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
14.06.2024 21:00, PG Bug reporting form wrote:
> The following script:
> ...
>
> produces errors like:
> 2024-06-14 17:10:58.349 UTC|law|regression|666c79a2.3fe3a4|ERROR: could not
> map filenumber "base/16384/16453" to relation OID
I've simplified that script to the following:
psql -c "CREATE TABLE t(t1 text, i2 int);"
for ((i=1;i<=100;i++)); do
echo "iteration $i";
cat << EOF | psql >/dev/null &
BEGIN;
SELECT txid_current();
SELECT pg_sleep(0.01);
EOF
cat << EOF | psql >/dev/null &
ALTER TABLE t RENAME COLUMN t1 TO i1;
ALTER TABLE t ALTER COLUMN i1 TYPE int4 USING (i1::int4);
SELECT pg_sleep(0.001); INSERT INTO t VALUES (1);
SELECT pg_sleep(0.001); INSERT INTO t VALUES (1);
SELECT pg_sleep(0.001); INSERT INTO t VALUES (1);
SELECT pg_sleep(0.001); INSERT INTO t VALUES (1);
SELECT pg_sleep(0.001); INSERT INTO t VALUES (1);
EOF
cat << EOF | psql >/dev/null &
SELECT 'init' FROM pg_create_logical_replication_slot(
'regression_slot_$i', 'test_decoding');
SELECT data FROM pg_logical_slot_get_changes(
'regression_slot_$i', NULL, NULL);
EOF
wait
grep 'ERROR' server.log && { break; }
cat << EOF | psql >/dev/null
SELECT pg_drop_replication_slot('regression_slot_$i');
ALTER TABLE t ALTER COLUMN i1 TYPE text USING (i1::text);
ALTER TABLE t RENAME COLUMN i1 TO t1;
EOF
done
It fails deterministically (though not on each run) for me as below:
...
iteration 93
ERROR: could not map filenumber "base/16384/17310" to relation OID
an excerpt from server.log (generated with log_min_messages = DEBUG3):
2024-07-07 09:15:36.167 UTC [324743:33] psql LOG: starting logical decoding for slot "regression_slot_93"
2024-07-07 09:15:36.167 UTC [324743:34] psql DETAIL: Streaming transactions committing after 0/2006458, reading WAL
from 0/1FFB820.
2024-07-07 09:15:36.167 UTC [324743:35] psql STATEMENT: SELECT data FROM pg_logical_slot_get_changes(
'regression_slot_93', NULL, NULL);
...
2024-07-07 09:15:36.169 UTC [324743:47] psql DEBUG: snapshot of 0+0 running transaction ids (lsn 0/20065A8 oldest xid
1666 latest complete 1665 next xid 1666)
2024-07-07 09:15:36.169 UTC [324743:48] psql DEBUG: purged committed transactions from 0 to 0, xmin: 1662, xmax: 1662
2024-07-07 09:15:36.169 UTC [324743:49] psql DEBUG: xmin: 1662, xmax: 1662, oldest running: 1662, oldest xmin: 1662
2024-07-07 09:15:36.169 UTC [324743:50] psql DEBUG: updated xmin: 1 restart: 0
2024-07-07 09:15:36.169 UTC [324743:51] psql DEBUG: xmin required by slots: data 0, catalog 1662
2024-07-07 09:15:36.169 UTC [324743:52] psql LOG: logical decoding found consistent point at 0/2006420
2024-07-07 09:15:36.169 UTC [324743:53] psql DETAIL: There are no running transactions.
2024-07-07 09:15:36.169 UTC [324743:54] psql STATEMENT: SELECT data FROM pg_logical_slot_get_changes(
'regression_slot_93', NULL, NULL);
2024-07-07 09:15:36.169 UTC [324743:55] psql DEBUG: aborting old transaction 1662
2024-07-07 09:15:36.170 UTC [324743:56] psql ERROR: could not map filenumber "base/16384/17310" to relation OID
2024-07-07 09:15:36.170 UTC [324743:57] psql STATEMENT: SELECT data FROM pg_logical_slot_get_changes(
'regression_slot_93', NULL, NULL);
The corresponding fragment of WAL:
rmgr: Transaction len (rec/tot): 353/ 353, tx: 1662, lsn: 0/02005B38, prev 0/02005B00, desc: INVALIDATION ;
inval msgs: catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache
6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 relcache 17310 snapshot 2608
rmgr: Transaction len (rec/tot): 1605/ 1605, tx: 1662, lsn: 0/02005CA0, prev 0/02005B38, desc: COMMIT
2024-07-07 09:15:36.150647 UTC; rels: base/16384/17305 base/16384/17308 base/16384/17309; dropped stats: 2/16384/17309
2/16384/17308 2/16384/17310; inval msgs: catcache 7 ... relcache 17310 ... relcache 17310 snapshot 2608
rmgr: Transaction len (rec/tot): 34/ 34, tx: 1661, lsn: 0/02006300, prev 0/02005CA0, desc: ABORT 2024-07-07
09:15:36.157644 UTC
rmgr: Standby len (rec/tot): 78/ 78, tx: 0, lsn: 0/02006328, prev 0/02006300, desc: LOCK xid 1662 db
16384 rel 17308 xid 1662 db 16384 rel 17309 xid 1662 db 16384 rel 16385 xid 1662 db 16384 rel 17310
rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 0/02006378, prev 0/02006328, desc: RUNNING_XACTS
nextXid 1663 latestCompletedXid 1661 oldestRunningXid 1662; 1 xacts: 1662
rmgr: Heap len (rec/tot): 59/ 59, tx: 1663, lsn: 0/020063B0, prev 0/02006378, desc: INSERT+INIT off:
1, flags: 0x08, blkref #0: rel 1663/16384/17310 blk 3
rmgr: Transaction len (rec/tot): 46/ 46, tx: 1663, lsn: 0/020063F0, prev 0/020063B0, desc: COMMIT
2024-07-07 09:15:36.162938 UTC
(The transaction 1662 performs table rewrite.)
What I'm observing with some debug logging added is:
ReorderBufferProcessTXN(), called during processing the commit record
0/02006328, gets reloid = 0 from RelidByRelfilenumber() because
systable_beginscan() chooses (via GetCatalogSnapshot()) HistoricSnapshot
with xmin = 1662, xmax = 1662. That snapshot was set inside
SnapBuildProcessChange(), via ReorderBufferSetBaseSnapshot(), from
builder->snapshot, which is built inside SnapBuildProcessChange(), via
SnapBuildBuildSnapshot(), from builder->xmin (builder->xmax), which is
set by SnapBuildProcessRunningXacts from running->oldestRunningXid = 1662.
Really, WAL contains RUNNING_XACTS oldestRunningXid 1662 before the record
being decoded.
So I wonder if the RUNNING_XACT record in WAL is correct or it should be
decoded differently?
Best regards,
Alexander
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2024-07-07 23:56:42 | Re: Potential data loss due to race condition during logical replication slot creation |
Previous Message | Muhammad Ikram | 2024-07-07 10:56:35 | Re: Error when installing PostgreSQL 16.3.2 under system context NT AUTHORITY\SYSTEM |