| From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
|---|---|
| To: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> |
| Cc: | Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders |
| Date: | 2017-04-22 15:41:01 |
| Message-ID: | 4219.1492875661@sss.pgh.pa.us |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-committers pgsql-hackers |
Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> writes:
> The assertion fails reliably for me, because standby2's reported write
> LSN jumps backwards after the timeline changes: for example I see
> 3020000 then 3028470 then 3020000 followed by a normal progression.
> Surprisingly, 004_timeline_switch.pl reports success anyway. I'm not
> sure why the test fails sometimes on tern, but you can see that even
> when it passed on tern the assertion had failed.
Whoa. This just turned into a much larger can of worms than I expected.
How can it be that processes are getting assertion crashes and yet the
test framework reports success anyway? That's impossibly
broken/unacceptable.
Looking closer at the tern report we started the thread with, there
are actually TWO assertion trap reports, the one Alvaro noted and
another one in 009_twophase_master.log:
TRAP: FailedAssertion("!(*ptr == ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "subtrans.c", Line: 92)
When I run the recovery test on my own machine, it reports success
(quite reliably, I tried a bunch of times yesterday), but now that
I know to look:
$ grep TRAP tmp_check/log/*
tmp_check/log/009_twophase_master.log:TRAP: FailedAssertion("!(*ptr == ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "subtrans.c", Line: 92)
So we now have three problems not just one:
* How is it that the TAP tests aren't noticing the failure? This one,
to my mind, is a code-red situation, as it basically invalidates every
TAP test we've ever run.
* If Thomas's explanation for the timeline-switch assertion is correct,
why isn't it reproducible everywhere?
* What's with that second TRAP?
> Here is a fix for the assertion failure.
As for this patch itself, is it reasonable to try to assert that the
timeline has in fact changed?
regards, tom lane
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2017-04-22 15:59:37 | Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders |
| Previous Message | Andrew Dunstan | 2017-04-22 14:28:49 | pgsql: Require sufficiently modern version of Test::More for TAP tests |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Masahiko Sawada | 2017-04-22 15:51:52 | Re: Interval for launching the table sync worker |
| Previous Message | Michael Paquier | 2017-04-22 14:31:58 | Re: Small patch for pg_basebackup argument parsing |