From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Amit Kapila <akapila(at)postgresql(dot)org>, pgsql-committers(at)lists(dot)postgresql(dot)org |
Subject: | Re: pgsql: Allow using the updated tuple while moving it to a different par |
Date: | 2018-07-12 21:57:14 |
Message-ID: | 11887.1531432634@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-committers pgsql-hackers |
Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> writes:
> On Thu, Jul 12, 2018 at 1:29 PM, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>> Looks unrelated, but flaviventris (a snake?) has just complained:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2018-07-12%2007%3A36%3A01
> Yeah, I am confused as to how this commit can lead to that failure.
It didn't; looks like just a random failure to me. The test is timing
out:
ok 26 - catalog xmin of cascaded slot still null with hs_feedback reset
# re-enabling hot_standby_feedback and disabling while stopped
### Reloading node "standby_2"
# Running: pg_ctl -D /home/bf/build/buildfarm-flaviventris/HEAD/pgsql.build/src/test/recovery/tmp_check/t_001_stream_rep_standby_2_data/pgdata reload
server signaled
Waiting for replication conn standby_1's replay_lsn to pass '0/30A1720' on master
done
Waiting for replication conn standby_2's replay_lsn to pass '0/30A1720' on standby_1
done
### Stopping node "standby_2" using mode fast
# Running: pg_ctl -D /home/bf/build/buildfarm-flaviventris/HEAD/pgsql.build/src/test/recovery/tmp_check/t_001_stream_rep_standby_2_data/pgdata -m fast stop
waiting for server to shut down.... done
server stopped
# No postmaster PID for node "standby_2"
#
Timed out waiting for slot xmins to advance at t/001_stream_rep.pl line 157.
This looks to me like it's probably a low-probability timing problem in
the test script itself. The relevant part of the script is
$node_master->safe_psql('postgres', qq[INSERT INTO tab_int VALUES (11000);]);
replay_check();
$node_standby_2->safe_psql('postgres',
'ALTER SYSTEM SET hot_standby_feedback = off;');
$node_standby_2->stop;
($xmin, $catalog_xmin) =
get_slot_xmins($node_standby_1, $slotname_2, "xmin IS NOT NULL");
isnt($xmin, '', 'xmin of cascaded slot non-null with postgres shut down');
where the failure is occurring because get_slot_xmins() never sees the
expected state. I'm suspicious that this is happening because the script
is making no attempt to ensure the relative timing of events on the three
servers --- notably, issuing a "reload" does nothing to guarantee that
the server has actually seen and responded to the parameter change.
As an extreme case, I can reproduce the failure exactly with this:
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index a0d3e8f..0824184 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -286,7 +286,7 @@ is($catalog_xmin, '',
note "re-enabling hot_standby_feedback and disabling while stopped";
$node_standby_2->safe_psql('postgres',
'ALTER SYSTEM SET hot_standby_feedback = on;');
-$node_standby_2->reload;
+# $node_standby_2->reload;
$node_master->safe_psql('postgres', qq[INSERT INTO tab_int VALUES (11000);]);
replay_check();
so I'm thinking that what we're seeing is an effect of the second standby
sometimes being too slow to turn on hot_standby_feedback.
We could maybe hack around that with a short "sleep" in this script, but
a less cruddy solution would be to change PostgresNode::reload so that
it somehow waits for the server to process the reload signal.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2018-07-13 00:48:39 | pgsql: Fix argument of pg_create_logical_replication_slot for slot name |
Previous Message | Michael Paquier | 2018-07-12 21:44:09 | pgsql: Clean up temporary WAL segments after an instance crash |
From | Date | Subject | |
---|---|---|---|
Next Message | Fabien COELHO | 2018-07-12 22:03:39 | Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pending solution of its timing is (fwd) |
Previous Message | Michael Paquier | 2018-07-12 21:45:35 | Re: Temporary WAL segments files not cleaned up after an instance crash |