Re: BUG #18509: Logical decoding behaves badly when processing a change record for a table with altered column

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18509: Logical decoding behaves badly when processing a change record for a table with altered column
Date: 2024-07-07 11:00:00
Message-ID: be69a11d-b7b5-9ee8-221a-fdf34898bb6d@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

14.06.2024 21:00, PG Bug reporting form wrote:
> The following script:
> ...
>
> produces errors like:
> 2024-06-14 17:10:58.349 UTC|law|regression|666c79a2.3fe3a4|ERROR: could not
> map filenumber "base/16384/16453" to relation OID

I've simplified that script to the following:
psql -c "CREATE TABLE t(t1 text, i2 int);"

for ((i=1;i<=100;i++)); do
echo "iteration $i";

cat << EOF | psql >/dev/null &
BEGIN;
SELECT txid_current();
SELECT pg_sleep(0.01);
EOF

cat << EOF | psql >/dev/null &
ALTER TABLE t RENAME COLUMN t1 TO i1;
ALTER TABLE t ALTER COLUMN i1 TYPE int4 USING (i1::int4);
SELECT pg_sleep(0.001); INSERT INTO t VALUES (1);
SELECT pg_sleep(0.001); INSERT INTO t VALUES (1);
SELECT pg_sleep(0.001); INSERT INTO t VALUES (1);
SELECT pg_sleep(0.001); INSERT INTO t VALUES (1);
SELECT pg_sleep(0.001); INSERT INTO t VALUES (1);
EOF

cat << EOF | psql >/dev/null &
SELECT 'init' FROM pg_create_logical_replication_slot(
  'regression_slot_$i', 'test_decoding');

SELECT data FROM pg_logical_slot_get_changes(
  'regression_slot_$i', NULL, NULL);
EOF
wait

grep 'ERROR' server.log  && { break; }

cat << EOF | psql >/dev/null
SELECT pg_drop_replication_slot('regression_slot_$i');
ALTER TABLE t ALTER COLUMN i1 TYPE text USING (i1::text);
ALTER TABLE t RENAME COLUMN i1 TO t1;
EOF

done

It fails deterministically (though not on each run) for me as below:
...
iteration 93
ERROR:  could not map filenumber "base/16384/17310" to relation OID

an excerpt from server.log (generated with log_min_messages = DEBUG3):
2024-07-07 09:15:36.167 UTC [324743:33] psql LOG:  starting logical decoding for slot "regression_slot_93"
2024-07-07 09:15:36.167 UTC [324743:34] psql DETAIL:  Streaming transactions committing after 0/2006458, reading WAL
from 0/1FFB820.
2024-07-07 09:15:36.167 UTC [324743:35] psql STATEMENT:  SELECT data FROM pg_logical_slot_get_changes(
      'regression_slot_93', NULL, NULL);
...
2024-07-07 09:15:36.169 UTC [324743:47] psql DEBUG:  snapshot of 0+0 running transaction ids (lsn 0/20065A8 oldest xid
1666 latest complete 1665 next xid 1666)
2024-07-07 09:15:36.169 UTC [324743:48] psql DEBUG:  purged committed transactions from 0 to 0, xmin: 1662, xmax: 1662
2024-07-07 09:15:36.169 UTC [324743:49] psql DEBUG:  xmin: 1662, xmax: 1662, oldest running: 1662, oldest xmin: 1662
2024-07-07 09:15:36.169 UTC [324743:50] psql DEBUG:  updated xmin: 1 restart: 0
2024-07-07 09:15:36.169 UTC [324743:51] psql DEBUG:  xmin required by slots: data 0, catalog 1662
2024-07-07 09:15:36.169 UTC [324743:52] psql LOG:  logical decoding found consistent point at 0/2006420
2024-07-07 09:15:36.169 UTC [324743:53] psql DETAIL:  There are no running transactions.
2024-07-07 09:15:36.169 UTC [324743:54] psql STATEMENT:  SELECT data FROM pg_logical_slot_get_changes(
      'regression_slot_93', NULL, NULL);
2024-07-07 09:15:36.169 UTC [324743:55] psql DEBUG:  aborting old transaction 1662
2024-07-07 09:15:36.170 UTC [324743:56] psql ERROR:  could not map filenumber "base/16384/17310" to relation OID
2024-07-07 09:15:36.170 UTC [324743:57] psql STATEMENT:  SELECT data FROM pg_logical_slot_get_changes(
      'regression_slot_93', NULL, NULL);

The corresponding fragment of WAL:
rmgr: Transaction len (rec/tot):    353/   353, tx:       1662, lsn: 0/02005B38, prev 0/02005B00, desc: INVALIDATION ;
inval msgs: catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache
6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 relcache 17310 snapshot 2608
rmgr: Transaction len (rec/tot):   1605/  1605, tx:       1662, lsn: 0/02005CA0, prev 0/02005B38, desc: COMMIT
2024-07-07 09:15:36.150647 UTC; rels: base/16384/17305 base/16384/17308 base/16384/17309; dropped stats: 2/16384/17309
2/16384/17308 2/16384/17310; inval msgs: catcache 7 ... relcache 17310 ... relcache 17310 snapshot 2608
rmgr: Transaction len (rec/tot):     34/    34, tx:       1661, lsn: 0/02006300, prev 0/02005CA0, desc: ABORT 2024-07-07
09:15:36.157644 UTC
rmgr: Standby     len (rec/tot):     78/    78, tx:          0, lsn: 0/02006328, prev 0/02006300, desc: LOCK xid 1662 db
16384 rel 17308 xid 1662 db 16384 rel 17309 xid 1662 db 16384 rel 16385 xid 1662 db 16384 rel 17310
rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn: 0/02006378, prev 0/02006328, desc: RUNNING_XACTS
nextXid 1663 latestCompletedXid 1661 oldestRunningXid 1662; 1 xacts: 1662
rmgr: Heap        len (rec/tot):     59/    59, tx:       1663, lsn: 0/020063B0, prev 0/02006378, desc: INSERT+INIT off:
1, flags: 0x08, blkref #0: rel 1663/16384/17310 blk 3
rmgr: Transaction len (rec/tot):     46/    46, tx:       1663, lsn: 0/020063F0, prev 0/020063B0, desc: COMMIT
2024-07-07 09:15:36.162938 UTC
(The transaction 1662 performs table rewrite.)

What I'm observing with some debug logging added is:
ReorderBufferProcessTXN(), called during processing the commit record
0/02006328, gets reloid = 0 from RelidByRelfilenumber() because
systable_beginscan() chooses (via GetCatalogSnapshot()) HistoricSnapshot
with xmin = 1662, xmax = 1662. That snapshot was set inside
SnapBuildProcessChange(), via ReorderBufferSetBaseSnapshot(), from
builder->snapshot, which is built inside SnapBuildProcessChange(), via
SnapBuildBuildSnapshot(), from builder->xmin (builder->xmax), which is
set by SnapBuildProcessRunningXacts from running->oldestRunningXid = 1662.

Really, WAL contains RUNNING_XACTS oldestRunningXid 1662 before the record
being decoded.

So I wonder if the RUNNING_XACT record in WAL is correct or it should be
decoded differently?

Best regards,
Alexander

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2024-07-07 23:56:42 Re: Potential data loss due to race condition during logical replication slot creation
Previous Message Muhammad Ikram 2024-07-07 10:56:35 Re: Error when installing PostgreSQL 16.3.2 under system context NT AUTHORITY\SYSTEM