Quick Links

Re: improving wraparound behavior

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: improving wraparound behavior
Date:	2019-05-04 03:08:44
Message-ID:	20190504030844.GR6197@tamriel.snowman.net
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Greetings,

* Andres Freund (andres(at)anarazel(dot)de) wrote:
> On 2019-05-03 22:41:11 -0400, Stephen Frost wrote:
> > I suppose it is a pretty big change in the base autovacuum launcher to
> > be something that's run per database instead and then deal with the
> > coordination between the two... but I can't help but feel like it
> > wouldn't be that much *work*. I'm not against doing something smaller
> > but was something smaller actually proposed for this specific issue..?
>
> I think it'd be fairly significant. And that we should redo it from
> scratch if we go there - because what we have isn't worth using as a
> basis.

Alright, what I'm hearing here is that we should probably have a
dedicated thread for this discussion, if someone has the cycles to spend
on it. I'm not against that.

> > > I'm thinking that we'd do something roughly like (in actual code) for
> > > GetNewTransactionId():
> > >
> > > TransactionId dat_limit = ShmemVariableCache->oldestXid;
> > > TransactionId slot_limit = Min(replication_slot_xmin, replication_slot_catalog_xmin);
> > > Transactionid walsender_limit;
> > > Transactionid prepared_xact_limit;
> > > Transactionid backend_limit;
> > >
> > > ComputeOldestXminFromProcarray(&walsender_limit, &prepared_xact_limit, &backend_limit);
> > >
> > > if (IsOldest(dat_limit))
> > > ereport(elevel,
> > > errmsg("close to xid wraparound, held back by database %s"),
> > > errdetail("current xid %u, horizon for database %u, shutting down at %u"),
> > > errhint("..."));
> > > else if (IsOldest(slot_limit))
> > > ereport(elevel, errmsg("close to xid wraparound, held back by replication slot %s"),
> > > ...);
> > >
> > > where IsOldest wouldn't actually compare plainly numerically, but would
> > > actually prefer showing the slot, backend, walsender, prepared_xact, as
> > > long as they are pretty close to the dat_limit - as in those cases
> > > vacuuming wouldn't actually solve the issue, unless the other problems
> > > are addressed first (as autovacuum won't compute a cutoff horizon that's
> > > newer than any of those).
> >
> > Where the errhint() above includes a recommendation to run the SRF
> > described below, I take it?
>
> Not necessarily. I feel conciseness is important too, and this would be
> the most imporant thing to tackle.

I'm imagining a relatively rare scenario, just to be clear, where
"pretty close to the dat_limit" would apply to more than just one thing.

> > Also, should this really be an 'else if', or should it be just a set of
> > 'if()'s, thereby giving users more info right up-front?
>
> Possibly? But it'd also make it even harder to read the log / the system
> to keep up with logging, because we already log *so* much when close to
> wraparound.

Yes, we definitely log a *lot*, and probably too much since other
critical messages might get lost in the noise.

> If we didn't order it, it'd be hard for users to figure out which to
> address first. If we ordered it, people have to further up in the log to
> figure out which is the most urgent one (unless we reverse the order,
> which is odd too).

This makes me think we should both order it and combine it into one
message... but that'd then be pretty difficult to deal with,
potentially, from a translation standpoint and just from a "wow, that's
a huge log message", which is kind of the idea behind the SRF- to give
you all that info in a more easily digestible manner.

Not sure I've got any great ideas on how to improve on this. I do think
that if we know that there's multiple different things that are within a
small number of xids of the oldest xmin then we should notify the user
about all of them, either directly in the error messages or by referring
them to the SRF, so they have the opportunity to address them all, or
at least know about them all. As mentioned though, it's likely to be a
quite rare thing to run into, so you'd have to be extra unlucky to even
hit this case and perhaps the extra code complication just isn't worth
it.

Thanks,

Stephen

In response to

Re: improving wraparound behavior at 2019-05-04 02:47:42 from Andres Freund

Responses

Re: improving wraparound behavior at 2019-05-06 07:06:00 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Euler Taveira	2019-05-04 03:11:32	Re: improving wraparound behavior
Previous Message	Andres Freund	2019-05-04 02:47:42	Re: improving wraparound behavior