From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
Cc: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Kehlet <steve(dot)kehlet(at)gmail(dot)com>, Forums postgresql <pgsql-general(at)postgresql(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1 |
Date: | 2015-06-03 18:44:18 |
Message-ID: | 20150603184418.GG18006@awork2.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general pgsql-hackers |
On 2015-06-03 15:01:46 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> > That's not necessarily the case though, given how the code currently
> > works. In a bunch of places the SLRUs are accessed *before* having been
> > made consistent by WAL replay. Especially if several checkpoints/vacuums
> > happened during the base backup the assumed state (i.e. the mxacts
> > checkpoints refer to) of the data directory soon after the initial
> > start, and the state of pg_multixact/ won't necessarily match at all.
>
> Well, the log extract is quite simple; it says that as soon as the
> standby starts replay WAL, it fetches exactly one WAL segment, which
> contains exactly one online checkpoint record; this record contains
> oldestMulti=4624559, which belongs in offset file 0046. But the
> basebackup only contains files starting from 004B onwards. The base
> backup takes a long time, so my guess of what happened is that the
> backup label points to the start of the WAL stream (obviously), then the
> data files are copied to the standby; during this long process, a
> multixact truncation happens. So by the time the base backup reaches
> the pg_multixact directory, the 0046 file has been removed.
Yes. That's precisely what I was concerned about above and elsewhere in
the thread. It's simply not ok to access a SLRU while not yet
consistent...
> One idea I had was: what if the oldestMulti pointed to another multi
> earlier in the same 0046 file, so that it is read-as-zeroes (and the
> file is created), and then a subsequent multixact truncate tries to read
> a later page in the file. In SlruPhysicalReadPage() this would give a
> change for open() to not fail, and then read() can fail as above.
> However, we would have an earlier LOG message about "reading as zeroes".
>
> Really, the whole question of how this code goes past the open() failure
> in SlruPhysicalReadPage baffles me. I don't see any possible way for
> the file to be created ...
Wouldn't a previous WAL record zeroing another part of that segment
explain this? A zero sized segment pretty much would lead to this error,
right? Or were you able to check how things look after the failure?
> 2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode
> 2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: restored log file "00000001000073DD000000AD" from archive
> 2015-05-27 16:15:18 UTC [4782]: [5-1] user=,db= FATAL: could not access status of transaction 4624559
> 2015-05-27 16:15:18 UTC [4782]: [6-1] user=,db= DETAIL: Could not read from file "pg_multixact/offsets/0046" at offset 147456: Success.
> 2015-05-27 16:15:18 UTC [4778]: [4-1] user=,db= LOG: startup process (PID 4782) exited with exit code 1
> 2015-05-27 16:15:18 UTC [4778]: [5-1] user=,db= LOG: aborting startup due to startup process failure
From this isn't not entirely clear where this error was triggered from.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2015-06-03 19:04:47 | Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1 |
Previous Message | Alvaro Herrera | 2015-06-03 18:29:40 | Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1 |
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2015-06-03 19:04:47 | Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1 |
Previous Message | Alvaro Herrera | 2015-06-03 18:29:40 | Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1 |