From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | Alexander Lakhin <exclusion(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Why is src/test/modules/committs/t/002_standby.pl flaky? |
Date: | 2022-01-10 21:25:27 |
Message-ID: | 2411751.1641849927@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> Hmm, one thing I'm still unclear on: did this problem really start
> with 6051857fc/ed52c3707? My initial email in this thread lists
> similar failures going back further, doesn't it? (And what's tern
> doing mixed up in this mess?)
Well, those earlier ones may be committs failures, but a lot of
them contain different-looking symptoms, eg pg_ctl failures.
tern's failure at
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2021-07-17+10%3A57%3A49
does look similar, but we can see in its log that the standby
*did* notice the primary disconnection and then reconnect:
2021-07-17 16:29:08.248 UTC [17498380:2] LOG: replication terminated by primary server
2021-07-17 16:29:08.248 UTC [17498380:3] DETAIL: End of WAL reached on timeline 1 at 0/30378F8.
2021-07-17 16:29:08.248 UTC [17498380:4] FATAL: could not send end-of-streaming message to primary: no COPY in progress
2021-07-17 16:29:08.248 UTC [25166230:5] LOG: invalid record length at 0/30378F8: wanted 24, got 0
2021-07-17 16:29:08.350 UTC [16318578:1] FATAL: could not connect to the primary server: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2021-07-17 16:29:36.369 UTC [7077918:1] FATAL: could not connect to the primary server: FATAL: the database system is starting up
2021-07-17 16:29:36.380 UTC [11338028:1] FATAL: could not connect to the primary server: FATAL: the database system is starting up
...
2021-07-17 16:29:36.881 UTC [17367092:1] LOG: started streaming WAL from primary at 0/3000000 on timeline 1
So I'm not sure what happened there, but it's not an instance
of this problem. One thing that looks a bit suspicious is
this in the primary's log:
2021-07-17 16:26:47.832 UTC [12386550:1] LOG: using stale statistics instead of current ones because stats collector is not responding
which makes me wonder if the timeout is down to out-of-date
pg_stats data. The loop in 002_standby.pl doesn't appear to
depend on the stats collector:
my $primary_lsn =
$primary->safe_psql('postgres', 'select pg_current_wal_lsn()');
$standby->poll_query_until('postgres',
qq{SELECT '$primary_lsn'::pg_lsn <= pg_last_wal_replay_lsn()})
or die "standby never caught up";
but maybe I'm missing the connection.
Apropos of that, it's worth noting that wait_for_catchup *is*
dependent on up-to-date stats, and here's a recent run where
it sure looks like the timeout cause is AWOL stats collector:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2022-01-10%2004%3A51%3A34
I wonder if we should refactor wait_for_catchup to probe the
standby directly instead of relying on the upstream's view.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2022-01-10 21:38:40 | Re: Column Filtering in Logical Replication |
Previous Message | Andrew Dunstan | 2022-01-10 21:21:39 | Re: Why is src/test/modules/committs/t/002_standby.pl flaky? |