Re: pg_upgrade and wraparound

From: Alexander Shutyaev <shutyaev(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>, pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: pg_upgrade and wraparound
Date: 2018-06-11 18:32:57
Message-ID: CAGBp8g_6broXMaYDujm9B1Az+YvhVrJStZmbKNJApQ9aw0KTqA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I'm back with more details.

First, I've deleted the smaller sslentry database, since I don't need it,
just so that it doesn't somehow spoil the picture. Now there is only 1 user
database - bof (OID=16400). After that I've ran the pg_upgrade on a clean
10.4 cluster and it failed in the same way.

Now, the answers to your queries.

>> 2) The upgrade stops because of transaction ID wraparound, which is
strange as that is not showing up in the 9.6 cluster I presume. You might
want the queries found below on the 9.6 and 10 clusters to help figure this
out:

>> SELECT c.oid::regclass as table_name,
>> greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age
>> FROM pg_class c
>> LEFT JOIN pg_class t ON c.reltoastrelid = t.oid
>> WHERE c.relkind IN ('r', 'm');

9.6 result - see attached 96-query1.txt
10.4 result - see attached 104-query1.txt

>> SELECT datname, age(datfrozenxid) FROM pg_database;

9.6 result - see attached 96-query2.txt
10.4 result - see attached 104-query2.txt

>> It might be useful
>> to see pg_controldata output for the old cluster, as well as
>> "select datname, datfrozenxid from pg_database" output from the
>> old cluster.

for the query - see above, for pg_controldata:

9.6 - see attached 96-pg_controldata.txt
10.4 - see attached 104-pg_controldata.txt

>> Alexander, could you hack things up so autovacuum logging is enabled
>> (log_autovacuum_min_duration=0), and see whether it's triggered?

I'll be happy to, but that will require to run pg_upgrade once more and
that takes more that half a day and during this time clusters are not
available to me. Given the data I'm attaching it may happen that the
colleagues will want to see something else from my clusters or maybe change
some settings before running the pg_upgrade again. Therefore, I'll wait 12
hours after this message in case there will be any more requests and the
I'll run the pg_upgrade again.

Thank you all for trying to solve this matter, this is much appreciated! :)

2018-06-11 20:29 GMT+03:00 Andres Freund <andres(at)anarazel(dot)de>:

> On 2018-06-11 13:14:12 -0400, Tom Lane wrote:
> > Andres Freund <andres(at)anarazel(dot)de> writes:
> > > I suspect the issue is that pg_resetwal does:
> > > if (set_xid != 0)
> > > {
> > > ControlFile.checkPointCopy.nextXid = set_xid;
> >
> > > /*
> > > * For the moment, just set oldestXid to a value that will
> force
> > > * immediate autovacuum-for-wraparound. It's not clear
> whether adding
> > > * user control of this is useful, so let's just do
> something that's
> > > * reasonably safe. The magic constant here corresponds
> to the
> > > * maximum allowed value of autovacuum_freeze_max_age.
> > > */
> > > ControlFile.checkPointCopy.oldestXid = set_xid -
> 2000000000;
> > > if (ControlFile.checkPointCopy.oldestXid <
> FirstNormalTransactionId)
> > > ControlFile.checkPointCopy.oldestXid +=
> FirstNormalTransactionId;
> > > ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
> > > }
> >
> > > but we have codepath that doesn't check for oldestXidDB being
> > > InvalidOid. Not great.
> >
> > Hm, I think I'd define the problem as "pg_resetwal is violating the
> > expectation that oldestXidDB be valid".
>
> Well, what could it do otherwise? ForceTransactionIdLimitUpdate()
> currently does a syscache check for database existence. That'll just
> return a lookup failure for InvalidOid, so we're reasonably good on that
> front.
>
> Using a hardcoded 2000000000 seems worse, will have funny results if
> running with a smaller autovacuum_freeze_max_age...
>
>
> > However, this just explains the basically-cosmetic issue that the
> > complaint message mentions OID 0. It doesn't really get us to the
> > answer to why Alexander is seeing a failure. It might be useful
> > to see pg_controldata output for the old cluster, as well as
> > "select datname, datfrozenxid from pg_database" output from the
> > old cluster.
>
> pg_upgrade starts the server with autovacuum disabled, I suspect
> restoring all the large objects ends up using a lot of transaction
> ids. GetNewTransactionId() should start autovacuum, but I'd guess that's
> where things are going wrong for some reason.
>
> Alexander, could you hack things up so autovacuum logging is enabled
> (log_autovacuum_min_duration=0), and see whether it's triggered?
>
> I'm not entirely clear why pg_restore appears to use a separate
> transaction for each large object, surely exascerbating the problem.
>
> Greetings,
>
> Andres Freund
>

Attachment Content-Type Size
96-pg_controldata.txt text/plain 2.3 KB
96-query1.txt text/plain 19.6 KB
96-query2.txt text/plain 152 bytes
104-pg_controldata.txt text/plain 2.4 KB
104-query1.txt text/plain 22.1 KB
104-query2.txt text/plain 152 bytes

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2018-06-11 19:10:21 Re: pg_upgrade and wraparound
Previous Message Adrian Klaver 2018-06-11 18:31:42 Re: Catching unique_violation exception on specific column/index