From: | Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com> |
---|---|
To: | Alexander Shutyaev <shutyaev(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: pg_upgrade and wraparound |
Date: | 2018-06-11 19:10:21 |
Message-ID: | 651b245a-8874-0cd4-827c-a3691dbda6eb@aklaver.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On 06/11/2018 11:32 AM, Alexander Shutyaev wrote:
> I'm back with more details.
>
> First, I've deleted the smaller sslentry database, since I don't need
> it, just so that it doesn't somehow spoil the picture. Now there is only
> 1 user database - bof (OID=16400). After that I've ran the pg_upgrade on
> a clean 10.4 cluster and it failed in the same way.
>
> Now, the answers to your queries.
>
> >> 2) The upgrade stops because of transaction ID wraparound, which is
> strange as that is not showing up in the 9.6 cluster I presume. You
> might want the queries found below on the 9.6 and 10 clusters to help
> figure this out:
>
> >> SELECT c.oid::regclass as table_name,
> >> greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age
> >> FROM pg_class c
> >> LEFT JOIN pg_class t ON c.reltoastrelid = t.oid
> >> WHERE c.relkind IN ('r', 'm');
>
> 9.6 result - see attached 96-query1.txt
> 10.4 result - see attached 104-query1.txt
I should have been clearer in my previous post, the above query is per
database. From the query below I am going to say the above query was
done on the bof database. Is that correct?
Given the below from 96-query2.txt:
template0 | 110588398
Can you run the table_name query in template0 in the 9.6 cluster?
>
> >> SELECT datname, age(datfrozenxid) FROM pg_database;
>
> 9.6 result - see attached 96-query2.txt
> 10.4 result - see attached 104-query2.txt
>
> >> It might be useful
> >> to see pg_controldata output for the old cluster, as well as
> >> "select datname, datfrozenxid from pg_database" output from the
> >> old cluster.
>
> for the query - see above, for pg_controldata:
>
> 9.6 - see attached 96-pg_controldata.txt
> 10.4 - see attached 104-pg_controldata.txt
>
> >> Alexander, could you hack things up so autovacuum logging is enabled
> >> (log_autovacuum_min_duration=0), and see whether it's triggered?
>
> I'll be happy to, but that will require to run pg_upgrade once more and
> that takes more that half a day and during this time clusters are not
> available to me. Given the data I'm attaching it may happen that the
> colleagues will want to see something else from my clusters or maybe
> change some settings before running the pg_upgrade again. Therefore,
> I'll wait 12 hours after this message in case there will be any more
> requests and the I'll run the pg_upgrade again.
>
> Thank you all for trying to solve this matter, this is much appreciated! :)
>
> 2018-06-11 20:29 GMT+03:00 Andres Freund <andres(at)anarazel(dot)de
> <mailto:andres(at)anarazel(dot)de>>:
>
> On 2018-06-11 13:14:12 -0400, Tom Lane wrote:
> > Andres Freund <andres(at)anarazel(dot)de <mailto:andres(at)anarazel(dot)de>> writes:
> > > I suspect the issue is that pg_resetwal does:
> > > if (set_xid != 0)
> > > {
> > > ControlFile.checkPointCopy.nextXid = set_xid;
> >
> > > /*
> > > * For the moment, just set oldestXid to a value that will force
> > > * immediate autovacuum-for-wraparound. It's not clear whether adding
> > > * user control of this is useful, so let's just do something that's
> > > * reasonably safe. The magic constant here corresponds to the
> > > * maximum allowed value of autovacuum_freeze_max_age.
> > > */
> > > ControlFile.checkPointCopy.oldestXid = set_xid - 2000000000;
> > > if (ControlFile.checkPointCopy.oldestXid < FirstNormalTransactionId)
> > > ControlFile.checkPointCopy.oldestXid += FirstNormalTransactionId;
> > > ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
> > > }
> >
> > > but we have codepath that doesn't check for oldestXidDB being
> > > InvalidOid. Not great.
> >
> > Hm, I think I'd define the problem as "pg_resetwal is violating the
> > expectation that oldestXidDB be valid".
>
> Well, what could it do otherwise? ForceTransactionIdLimitUpdate()
> currently does a syscache check for database existence. That'll just
> return a lookup failure for InvalidOid, so we're reasonably good on that
> front.
>
> Using a hardcoded 2000000000 seems worse, will have funny results if
> running with a smaller autovacuum_freeze_max_age...
>
>
> > However, this just explains the basically-cosmetic issue that the
> > complaint message mentions OID 0. It doesn't really get us to the
> > answer to why Alexander is seeing a failure. It might be useful
> > to see pg_controldata output for the old cluster, as well as
> > "select datname, datfrozenxid from pg_database" output from the
> > old cluster.
>
> pg_upgrade starts the server with autovacuum disabled, I suspect
> restoring all the large objects ends up using a lot of transaction
> ids. GetNewTransactionId() should start autovacuum, but I'd guess that's
> where things are going wrong for some reason.
>
> Alexander, could you hack things up so autovacuum logging is enabled
> (log_autovacuum_min_duration=0), and see whether it's triggered?
>
> I'm not entirely clear why pg_restore appears to use a separate
> transaction for each large object, surely exascerbating the problem.
>
> Greetings,
>
> Andres Freund
>
>
--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com
From | Date | Subject | |
---|---|---|---|
Next Message | Alexander Shutyaev | 2018-06-11 19:19:42 | Re: pg_upgrade and wraparound |
Previous Message | Alexander Shutyaev | 2018-06-11 18:32:57 | Re: pg_upgrade and wraparound |