Quick Links

Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Kehlet <steve(dot)kehlet(at)gmail(dot)com>, Forums postgresql <pgsql-general(at)postgresql(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Date:	2015-06-03 17:31:36
Message-ID:	20150603173136.GF18006@awork2.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general pgsql-hackers

On 2015-06-03 00:42:55 -0300, Alvaro Herrera wrote:
> Thomas Munro wrote:
> > On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
> > > My guess is that the file existed, and perhaps had one or more pages,
> > > but the wanted page doesn't exist, so we tried to read but got 0 bytes
> > > back. read() returns 0 in this case but doesn't set errno.
> > >
> > > I didn't find a way to set things so that the file exists but is of
> > > shorter contents than oldestMulti by the time the checkpoint record is
> > > replayed.
> >
> > I'm just starting to learn about the recovery machinery, so forgive me
> > if I'm missing something basic here, but I just don't get this. As I
> > understand it, offsets/0046 should either have been copied with that
> > page present in it if it existed before the backup started (apparently
> > not in this case), or extended to contain it by WAL records that come
> > after the backup label but before the checkpoint record that
> > references it (also apparently not in this case).

That's not necessarily the case though, given how the code currently
works. In a bunch of places the SLRUs are accessed *before* having been
made consistent by WAL replay. Especially if several checkpoints/vacuums
happened during the base backup the assumed state (i.e. the mxacts
checkpoints refer to) of the data directory soon after the initial
start, and the state of pg_multixact/ won't necessarily match at all.

> Exactly --- that's the spot at which I am, also. I have had this
> spinning in my head for three days now, and tried every single variation
> that I could think of, but like you I was unable to reproduce the issue.
> However, our customer took a second base backup and it failed in exactly
> the same way, module some changes to the counters (the file that
> didn't exist was 004B rather than 0046). I'm still at a loss at what
> the failure mode is. We must be missing some crucial detail ...

I might have missed it in this already long thread. Could you share a
bunch of details about hte case? It'd be very interesting to see the
contents of the backup label (to see where start/end are), the contents
of the initial checkpoint (to see which mxacts we assume to exist at
start) and what the initial contents of pg_multixact are (to match up).

Greetings,

Andres Freund

In response to

Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1 at 2015-06-03 03:42:55 from Alvaro Herrera

Responses

Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1 at 2015-06-03 18:01:46 from Alvaro Herrera

Browse pgsql-general by date

	From	Date	Subject
Next Message	Alvaro Herrera	2015-06-03 18:01:46	Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Previous Message	Alvaro Herrera	2015-06-03 17:26:02	Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Stefan Kaltenbrunner	2015-06-03 17:56:40	Re: [CORE] postpone next week's release
Previous Message	Alvaro Herrera	2015-06-03 17:26:02	Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1