Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Steve Kehlet <steve(dot)kehlet(at)gmail(dot)com>, Forums postgresql <pgsql-general(at)postgresql(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Date: 2015-06-02 15:29:24
Message-ID: CA+TgmoaAnNLS-cM7UhGMU=aaDw_LpCGdXKajP1MQApUvvpDG8A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

On Tue, Jun 2, 2015 at 8:56 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> But what *definitely* looks wrong to me is that a TruncateMultiXact() in
> this scenario now (since a couple weeks ago) does a
> SimpleLruReadPage_ReadOnly() in the members slru via
> find_multixact_start(). That just won't work acceptably when we're not
> yet consistent. There very well could not be a valid members segment at
> that point? Am I missing something?

Yes: that code isn't new.

TruncateMultiXact() called SimpleLruReadPage_ReadOnly() directly in
9.3.0 and every subsequent release until 9.3.7/9.4.2. The only thing
that's changed is that we've moved that logic into a function called
find_multixact_start() instead of having it directly inside that
function. We did that because we needed to use the same logic in some
other places. The reason why 9.3.7/9.4.2 are causing problems for
people that they didn't have previously is because those new,
additional call sites were poorly chosen and didn't include adequate
protection against calling that function with an invalid input value.
What this patch is about is getting back to the situation that we were
in from 9.3.0 - 9.3.6 and 9.4.0 - 9.4.1, where TruncateMultiXact() did
the thing that you're complaining about here but no one else did.

From my point of view, I think that you are absolutely right to
question what's going on in TruncateMultiXact(). It's kooky, and
there may well be bugs buried there. But I don't think fixing that
should be the priority right now, because we have zero reports of
problems attributable to that logic. I think the priority should be
on undoing the damage that we did in 9.3.7/9.4.2, when we made other
places to do the same thing. We started getting trouble reports
attributable to those changes *almost immediately*, which means that
whether or not TruncateMultiXact() is broken, these new call sites
definitely are. I think we really need to fix those new places ASAP.

> I think at the very least we'll have to skip this step while not yet
> consistent. That really sucks, because we'll possibly end up with
> multixacts that are completely filled by the time we've reached
> consistency.

That would be a departure from the behavior of every existing release
that includes this code based on, to my knowledge, zero trouble
reports.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message William Dunn 2015-06-02 15:35:58 Re: Database designpattern - product feature
Previous Message Andres Freund 2015-06-02 15:27:33 Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2015-06-02 15:34:17 Re: nested loop semijoin estimates
Previous Message Andres Freund 2015-06-02 15:27:33 Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1