From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
Cc: | "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Steve Kehlet <steve(dot)kehlet(at)gmail(dot)com>, Forums postgresql <pgsql-general(at)postgresql(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1 |
Date: | 2015-05-29 03:24:45 |
Message-ID: | CA+TgmoaC_TRogpqZ50vTrr3wu8hRugvkR2c4-V9RbrsA6x=3og@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general pgsql-hackers |
On Thu, May 28, 2015 at 10:41 PM, Alvaro Herrera
<alvherre(at)2ndquadrant(dot)com> wrote:
>> 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid
>> values which are equal to the next-mxid counter instead of the correct
>> value; in other words, they are too new.
>
> [ discussion of how the control file's oldestMultiXid gets set ]
I'm talking about the datminmxid in pg_database. You're talking about
the contents of pg_control. Those are two different things. The
relevant code is not what you quote, but rather this:
/* set pg_database.datminmxid */
PQclear(executeQueryOrDie(conn_template1,
"UPDATE
pg_catalog.pg_database "
"SET
datminmxid = '%u'",
old_cluster.controldata.chkpnt_nxtmulti));
Tom previously observed this to be wrong, here:
http://www.postgresql.org/message-id/9879.1405877821@sss.pgh.pa.us
Although Tom was correct to note that it's wrong, nothing ever got fixed. :-(
>> A. Most obviously, we should fix pg_upgrade so that it installs
>> chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so
>> that we stop creating new instances of this problem. That won't get
>> us out of the hole we've dug for ourselves, but we can at least try to
>> stop digging. (This is assuming I'm right that chkpnt_nxtmulti is the
>> wrong thing - anyone want to double-check me on that one?)
>
> I don't think there's anything that we need to fix here.
I see your followup now agreeing this is broken. Since I wrote the
previous email, I've had two new ideas that I think are both better
than the above.
1. Figure out the oldest multixact offset that actually exists in
pg_multixacts/offsets, and use that value. If any older MXIDs still
exist, they won't be able to be looked up anyway, so if they wrap
around, it doesn't matter. The only value that needs to be reliable
in order to do this is pg_controldata's NextMultiXactId, which to the
best of my knowledge is not implicated in any of these bugs.
pg_upgrade can check that the offsets file containing that value
exists, and if not bail out. Then, start stepping backwards a file at
a time. When it hits a missing file, the first multixact in the next
file is a safe value of datfrozenxid for every database in the new
cluster. If older MXIDs exist, they're unreadable anyway, so if they
wrap, nothing lost. If the value is older than necessary, the first
vacuum in each database will fix it. We have to be careful: if we
step back too many files, such that our proposed datfrozenxid might
wrap, then we've got a confusing situation and had better bail out -
or at least think really carefully about what to do.
2. When we're upgrading from a version 9.3 or higher, copy the EXACT
datminmxid from each old database to the corresponding new database.
This seems like it ought to be really simple.
>> - In DetermineSafeOldestOffset, find_multixact_start() is used to set
>> MultiXactState->offsetStopLimit. If it fails here, we don't know when
>> to refuse multixact creation to prevent wraparound. Again, in
>> recovery, that's fine. If it happens in normal running, it's not
>> clear what to do. Refusing multixact creation is an awfully blunt
>> instrument. Maybe we can scan pg_multixact/offsets to determine a
>> workable stop limit: the first file greater than the current file that
>> exists, minus two segments, is a good stop point. Perhaps we ought to
>> use this mechanism here categorically, not just when
>> find_multixact_start() fails. It might be more robust than what we
>> have now.
>
> Blunt instruments have the desirable property of being simple. We don't
> want any more clockwork here, I think --- this stuff is pretty
> complicated already. As far as I understand, if during normal running
> we see that find_multixact_start has failed, sufficient vacuuming should
> get it straight eventually with no loss of data.
Unfortunately, I don't believe that to be true. If
find_multixact_start() fails, we have no idea how close we are to the
member wraparound point. Sure, we can start vacuuming, but the user
can be creating new, large multixacts at top speed while we're doing
that, which could cause us to wrap around before we can finish
vacuuming.
Furthermore, if we adopted the blunt instrument, users who are in this
situation would update to 9.4.3 (or whenever these fixes get released)
and find that they can't create new MXIDs for a possibly very
protracted period of time. That amounts to an outage for which users
won't thank us.
Looking at the files in the directory seems pretty simple in this
case, and quite a bit more fail-safe than what we're doing right now.
The current logic purports to leave a one-file gap in the member
space, but there's no guarantee that the gap really exists on disk the
way we think it does. With this approach, we can be certain that
there is a gap. And that is a darned good thing to be certain about.
>> C. I think we should also change TruncateMultiXact() to truncate
>> offsets first, and then members. As things stand, if we truncate
>> members first, we increase the risk of seeing an offset that will fail
>> when passed to find_multixact_start(), because TruncateMultiXact()
>> might get interrupted before it finishes. That seem like an
>> unnecessary risk.
>
> Not sure about this point. We did it the way you propose previously,
> and found it to be a problem because sometimes we tried to read an
> offset file that was no longer there. Do we really read member files
> anywhere? I thought we only tried to read offset files. If we remove
> member files, what is it that we try to read and find not to be present?
Do you have a link to the previous discussion?
I mean, the problem we're having right now is that sometimes we have
an offset, but the corresponding member isn't there. So clearly
offsets reference members. Do members also reference offsets? I
didn't think so, but life is full of surprises.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Thomas Munro | 2015-05-29 10:43:58 | Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1 |
Previous Message | Alvaro Herrera | 2015-05-29 03:12:26 | Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1 |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2015-05-29 03:27:56 | Re: hstore_plpython regression test does not work on Python 3 |
Previous Message | Abhijit Menon-Sen | 2015-05-29 03:19:19 | Re: fsync-pgdata-on-recovery tries to write to more files than previously |