Quick Links

Re: BF member drongo doesn't like 035_standby_logical_decoding.pl

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Alexander Lakhin <exclusion(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Andrew Dunstan <andrew(at)dunslane(dot)net>, "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>
Subject:	Re: BF member drongo doesn't like 035_standby_logical_decoding.pl
Date:	2025-01-24 20:49:51
Message-ID:	zqypkuvtihtd2zbmwdfmcceujg4fuakrhojmjkxpp7t4udqkty@couhenc7dsor
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2025-01-24 14:56:45 -0500, Tom Lane wrote:
> Alexander Lakhin <exclusion(at)gmail(dot)com> writes:
> > There is also a reference to a discussion of the failure there:
> > https://www.postgresql.org/message-id/657815a2-5a89-fcc1-1c9d-d77a6986bc26@gmail.com
> > (In short, I observed that that test suffers from bgwriter's activity.)
>
> Hmm, that's an interesting theory, but it fails to explain why
> only drongo is showing this failure. It's not particularly slow
> compared to some other BF animals.

One interesting bit is that both on sucessful and failing runs the preceding
test is quite slow:

[14:18:13.386](3.192s) ok 25 - row_removal_activeslot on standby created
### Reloading node "standby"
# Running: pg_ctl -D C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/recovery/035_standby_logical_decoding\\data/t_035_standby_logical_decoding_standby_data/pgdata reload
server signaled

#### Begin standard error
psql:<stdin>:1: INFO: vacuuming "testdb.pg_catalog.pg_class"
psql:<stdin>:1: INFO: finished vacuuming "testdb.pg_catalog.pg_class": index scans: 1
pages: 0 removed, 11 remain, 11 scanned (100.00% of total)
tuples: 4 removed, 419 remain, 0 are dead but not yet removable
removable cutoff: 760, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
visibility map: 11 pages set all-visible, 10 pages set all-frozen (0 were all-visible)
index scan needed: 3 pages from table (27.27% of total) had 5 dead item identifiers removed
index "pg_class_oid_index": pages: 4 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "pg_class_relname_nsp_index": pages: 5 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "pg_class_tblspc_relfilenode_index": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable
avg read rate: 11.684 MB/s, avg write rate: 23.369 MB/s
buffer usage: 90 hits, 8 reads, 16 dirtied
WAL usage: 28 records, 15 full page images, 123107 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
#### End standard error
Waiting for replication conn standby's replay_lsn to pass 0/403FA80 on primary
done
[14:18:48.597](35.210s) ok 26 - inactiveslot slot invalidation is logged with vacuum on pg_class

That's really quite slow. Looking at a few other instances it always just
seems to take a few seconds. Obviously running for longer does make it more
likely for checkpoint/bgwriter to log running xacts.

Unfortunately there's no timing visible between the vacuum and waiting for
replication... And the passing runs don't seem to have the server logs.

Because of that I'm looking at a failure:

[05:23:26.630](3.831s) ok 21 - row_removal_activeslot on standby created
### Reloading node "standby"
# Running: pg_ctl -D C:\\prog\\bf\\root\\REL_16_STABLE\\pgsql.build/testrun/recovery/035_standby_logical_decoding\\data/t_035_standby_logical_decoding_standby_data/pgdata reload
server signaled

#### Begin standard error
psql:<stdin>:1: INFO: vacuuming "testdb.pg_catalog.pg_class"
psql:<stdin>:1: INFO: finished vacuuming "testdb.pg_catalog.pg_class": index scans: 1
pages: 0 removed, 11 remain, 11 scanned (100.00% of total)
tuples: 4 removed, 417 remain, 0 are dead but not yet removable
removable cutoff: 748, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 3 pages from table (27.27% of total) had 5 dead item identifiers removed
index "pg_class_oid_index": pages: 4 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "pg_class_relname_nsp_index": pages: 5 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "pg_class_tblspc_relfilenode_index": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable
avg read rate: 0.264 MB/s, avg write rate: 0.495 MB/s
buffer usage: 77 hits, 8 misses, 15 dirtied
WAL usage: 28 records, 15 full page images, 122841 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.23 s
#### End standard error
Waiting for replication conn standby's replay_lsn to pass 0/403E648 on primary
done
[05:23:54.630](28.000s) ok 22 - inactiveslot slot invalidation is logged with vacuum on pg_class
[05:23:54.643](0.013s) not ok 23 - activeslot slot invalidation is logged with vacuum on pg_class

The vacuum only starts at: 05:23:49.294, i.e. after ~23s.

2024-09-06 05:23:49.294 UTC [552:1] [unknown] LOG: connection received: host=127.0.0.1 port=51838
2024-09-06 05:23:49.301 UTC [552:2] [unknown] LOG: connection authenticated: identity="EC2AMAZ-P7KGG90\\pgrunner" method=sspi (C:/prog/bf/root/REL_16_STABLE/pgsql.build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_primary_data/pgdata/pg_hba.conf:2)
2024-09-06 05:23:49.301 UTC [552:3] [unknown] LOG: connection authorized: user=pgrunner database=testdb application_name=035_standby_logical_decoding.pl
2024-09-06 05:23:49.355 UTC [552:4] 035_standby_logical_decoding.pl LOG: statement: VACUUM verbose pg_class;
2024-09-06 05:23:49.607 UTC [552:5] 035_standby_logical_decoding.pl LOG: statement: INSERT INTO flush_wal DEFAULT VALUES;
2024-09-06 05:23:49.609 UTC [552:6] 035_standby_logical_decoding.pl LOG: disconnection: session time: 0:00:00.322 user=pgrunner database=testdb host=127.0.0.1 port=51838

Before that we see:

2024-09-06 05:23:27.683 UTC [168:4] 035_standby_logical_decoding.pl LOG: statement: SELECT xmin IS NULL AND catalog_xmin IS NULL
FROM pg_catalog.pg_replication_slots
WHERE slot_name = 'primary_physical';

2024-09-06 05:23:31.447 UTC [5836:4] 035_standby_logical_decoding.pl LOG: statement: select pg_snapshot_xmin(pg_current_snapshot());
2024-09-06 05:23:36.963 UTC [4908:4] 035_standby_logical_decoding.pl LOG: statement: CREATE TABLE conflict_test(x integer, y text);
2024-09-06 05:23:36.972 UTC [4908:5] 035_standby_logical_decoding.pl LOG: statement: DROP TABLE conflict_test;
2024-09-06 05:23:45.192 UTC [4240:4] 035_standby_logical_decoding.pl LOG: statement: SELECT (select pg_snapshot_xmin(pg_current_snapshot())::text::int - 746) > 0

Between 05:23:30.797 and 05:23:49.601 nothing is logged on the standby.

There's a lot of time between disconnects and connects in the test around that
time:

2024-09-06 05:23:31.591 UTC [5836:5] 035_standby_logical_decoding.pl LOG: disconnection: session time: 0:00:00.240 user=pgrunner database=testdb host=127.0.0.1 port=51817
2024-09-06 05:23:36.897 UTC [4908:1] [unknown] LOG: connection received: host=127.0.0.1 port=51829

2024-09-06 05:23:37.032 UTC [4908:6] 035_standby_logical_decoding.pl LOG: disconnection: session time: 0:00:00.142 user=pgrunner database=testdb host=127.0.0.1 port=51829
2024-09-06 05:23:45.135 UTC [4240:1] [unknown] LOG: connection received: host=127.0.0.1 port=51836

2024-09-06 05:23:45.272 UTC [4240:5] 035_standby_logical_decoding.pl LOG: disconnection: session time: 0:00:00.152 user=pgrunner database=testdb host=127.0.0.1 port=51836
2024-09-06 05:23:49.294 UTC [552:1] [unknown] LOG: connection received: host=127.0.0.1 port=51838

Even though afaict the perl script doesn't do anything else between those
connections.

> BTW, looking more closely at the failures, I notice that some of them
> are 'with vacuum on pg_authid' but others are 'with vacuum on pg_class'.
> So those are two different steps in the test script, and there are
> other steps that are very similar but have not failed. Curiouser
> and curiouser.

I think that makes some sense:

"Scenario 1":

Doesn't have the race to the same degree, because a VACUUM FULL on pg_class is
used, which prevents logical decoding from progressing after it started (due
to the logged AEL at the start of VACFULL).

"Scenario 2" and "Scenario 3":

Nearly the same test, except once with a database-local relation and once with
a shared rel.

"Scenario 4":

Doesn't expect to trigger a conflict, so the race doesn't matter.

"Scenario 5":

Theoretically racy, but the window seems much smaller, it looks like the
conflict has multiple chances to be triggered due to each of the updates
happening in a separate session (and thus transaction).

Greetings,

Andres Freund

In response to

Re: BF member drongo doesn't like 035_standby_logical_decoding.pl at 2025-01-24 19:56:45 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2025-01-24 20:53:42	Re: Eager aggregation, take 3
Previous Message	Robert Haas	2025-01-24 20:43:05	Re: Eagerly scan all-visible pages to amortize aggressive vacuum