From: | Mark Dilger <hornschnorter(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders |
Date: | 2017-04-23 02:44:28 |
Message-ID: | 3177FAD5-3600-4C8B-B338-FB108F60455F@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-committers pgsql-hackers |
> On Apr 22, 2017, at 11:40 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> I wrote:
>> Whoa. This just turned into a much larger can of worms than I expected.
>> How can it be that processes are getting assertion crashes and yet the
>> test framework reports success anyway? That's impossibly
>> broken/unacceptable.
>
> I poked into this on my laptop, where I'm able to reproduce both assertion
> failures. The reason the one in subtrans.c is not being detected by the
> test framework is that it happens late in the run and the test doesn't
> do anything more with that server than shut it down. "pg_ctl stop"
> actually notices there's a problem; for instance, the log on tern for
> this step shows
>
> ### Stopping node "master" using mode immediate
> # Running: pg_ctl -D /home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/tmp_check/data_master_xHGA/pgdata -m immediate stop
> pg_ctl: PID file "/home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/tmp_check/data_master_xHGA/pgdata/postmaster.pid" does not exist
> Is server running?
> # No postmaster PID
>
> However, PostgresNode::stop blithely ignores the failure exit from
> pg_ctl. I think it should use system_or_bail() not just system_log(),
> so that we'll notice if the server is not running when we expect it
> to be. It's conceivable that there should also be a "stop_if_running"
> method that allows the case where the server is not running, but so
> far as I can find, no existing TAP test needs such a behavior --- and
> it certainly shouldn't be the default.
>
> The walsender.c crash is harder to detect because the postmaster very
> nicely recovers and restarts its children. That's great for robustness,
> but it sucks for testing. I think that we ought to disable that in
> TAP tests, which we can do by having PostgresNode::init include
> "restart_after_crash = off" in the postgresql.conf modifications it
> applies. Again, it's conceivable that some tests would not want that,
> but there is no existing test that seems to need crash recovery on,
> and I don't see a good argument for it being default for test purposes.
>
> As best I've been able to tell, the reason why the walsender.c crash
> is detected when it's detected is that the TAP script chances to try
> to connect while the recovery is happening:
>
> connection error: 'psql: FATAL: the database system is in recovery mode'
> while running 'psql -XAtq -d port=57718 host=/tmp/_uP8FKEynq dbname='postgres' -f - -v ON_ERROR_STOP=1' at /home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pm line 1173.
>
> The window for that is not terribly wide, explaining why the detection
> is unreliable even when the crash does occur.
>
> In short then, I propose the attached patch to make these cases fail
> more reliably. We might extend this later to allow the old behaviors
> to be explicitly opted-into, but we don't seem to need that today.
>
> regards, tom lane
I pulled fresh sources with your latest commit, 7d68f2281a4b56834c8e5648fc7da0b73b674c45,
and the tests consistently fail (5 out of 5 attempts) for me on my laptop with:
/Applications/Xcode.app/Contents/Developer/usr/bin/make -C recovery check
for extra in contrib/test_decoding; do /Applications/Xcode.app/Contents/Developer/usr/bin/make -C '../../..'/$extra DESTDIR='/Users/mark/vanilla/bitoctetlength'/tmp_install install >>'/Users/mark/vanilla/bitoctetlength'/tmp_install/log/install.log || exit; done
rm -rf /Users/mark/vanilla/bitoctetlength/src/test/recovery/tmp_check/log
cd . && TESTDIR='/Users/mark/vanilla/bitoctetlength/src/test/recovery' PATH="/Users/mark/vanilla/bitoctetlength/tmp_install/usr/local/pgsql/bin:$PATH" DYLD_LIBRARY_PATH="/Users/mark/vanilla/bitoctetlength/tmp_install/usr/local/pgsql/lib" PGPORT='65432' PG_REGRESS='/Users/mark/vanilla/bitoctetlength/src/test/recovery/../../../src/test/regress/pg_regress' prove -I ../../../src/test/perl/ -I . t/*.pl
t/001_stream_rep.pl .................. ok
t/002_archiving.pl ................... ok
t/003_recovery_targets.pl ............ ok
t/004_timeline_switch.pl ............. Bailout called. Further testing stopped: system pg_ctl failed
FAILED--Further testing stopped: system pg_ctl failed
make[2]: *** [check] Error 255
make[1]: *** [check-recovery-recurse] Error 2
make: *** [check-world-src/test-recurse] Error 2
The first time, I had some local changes, but I've stashed them and am still getting the error each of
these next four times. I'm compiling as follows:
make distclean; make clean; ./configure --enable-cassert --enable-tap-tests --enable-depend && make -j4 && make check-world
Are the errors expected now?
mark
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2017-04-23 05:26:46 | Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders |
Previous Message | Thomas Munro | 2017-04-23 02:11:40 | Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2017-04-23 05:26:46 | Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders |
Previous Message | Bruce Momjian | 2017-04-23 02:37:06 | Statistics "dependency" |