Re: transaction wrap around

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: John R Pierce <pierce(at)hogranch(dot)com>, pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: transaction wrap around
Date: 2017-12-10 23:07:02
Message-ID: CAMkU=1xTARZLmLhvgR30p3RwdS=qR-e12i7DatDoFoCfstvUdg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, Dec 5, 2017 at 5:50 PM, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
wrote:

> On Tue, Dec 5, 2017 at 5:43 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> > On Mon, Dec 4, 2017 at 5:52 PM, John R Pierce <pierce(at)hogranch(dot)com>
> wrote:
> >> On 12/4/2017 2:21 PM, chris kim wrote:
> >>> How would I investigate if my database is nearing a transaction wrap
> >>> around.
> >>
> >> it would be screaming bloody murder in the log, for one.
> >>
> >
> > Unfortunately, that comes far too late to repair the problem without a
> > substantial service interruption, on very high transaction throughput
> > installations.
> >
> > Also, people usually consult the logs to figure out what the problem is,
> > once they become aware that one exists. That is also too late.
>
> The problem is that our logic (1) focuses on when we should *start*
> freezing, not by when we'd like to be finished, and (2) is defined in
> such a way that many tables are likely to reach the trigger point at
> the same time.

Isn't this only the case when you have many insert only-tables? Other
tables are going to be vacuumed for wrap around at the first time they are
vacuumed (for other reasons) after reaching vacuum_freeze_table_age
- vacuum_freeze_min_age. That should be pretty well staggered because
they probably have different update and delete rates. But, having those
tables locked for an emergency vacuum which is not really an emergency is
certainly a pain.

> Even if your system can handle the load, you might not
> like the disruption to regular vacuuming and analyze work.
>
> An ideal system would estimate how long it's going to take and how
> long we've got (current tx consumption rate, xids remaining) before
> autovacuum_freeze_max_age is reached and then spread the work out so
> that we get it done just in time with minimal impact. Getting
> reliable estimates to control that seems hard though.
>

I think an ideal system wouldn't even have a autovacuum_freeze_max_age
setting at all. What good does it do? According to the docs, its function
is to limit the size of the clog files, not it to save your database from
emergency shutdown.

Other more general problem which I think this brings up is:

A way to separate the log file stream into things that need immediate
attention, versus things you want to have available in case you ever go
looking for them. You could probably have some process monitor the log
file and pull things it thinks need attention and send an email or SMS, but
that is an ugly way to do it.

Also, why do we only start issuing warnings 10 million before wrap-around
shutdown happens? On a very busy server, this is not enough warning. On a
not-very-business server, why would it matter as you will never get
anywhere near this point? Seems like we could increase this to 200 million
with no down-side. (Maybe the first 190e6 of those would only warn once
every 1024 transaction rather every 64).

Perhaps we could add a much simpler first defence that tries to
> prevent autovacuum_freeze_max_age (and its multixact cousin) from
> being reached like this: consider launching at most one wraparound
> vacuum for any relation that is *half way* to
> autovacuum_freeze_max_age. That gives the system a chance to handle
> each partition of a monster partitioned table calmly in series even if
> they have the same age dating back to schema creation/data load time.
> Perhaps it could consider adding more vacuum backends as you approach
> autovacuum_freeze_max_age, or something. Hopefully you'd never
> actually reach it.
>

So in the simplest embodiment, the autovacuum launcher would keep track of
the number of vacuum workers which were launched for "eager wrap around"
purposes, and limit it to no more than one of those particular kinds at a
time? Would a "real" wrap-around vacuum block a "eager" one?

Would such a worker consider itself to be an emergency and hold the table
lock hostage, or would it yield the lock like regular vacuum workers do?

Of course you can do what I just said with a cron job, and there may
> be better heuristics than that, but it'd be nice to find *some* way to
> make freeze max age more gradual by default on large databases, until
> such time as we can kill it with 64 bit xids or other major efforts.
> My understanding is that even with the new freeze map, most big
> systems will still pay the full price for the first wraparound vacuum
> freeze, so I still expect to encounter 20TB production databases in
> the wild that have gone into a wraparound frenzy confounding their
> owners.
>

I don't think the problem is so much the frenzy (wrap around vacuums are
throttled just as much as regular ones are, which is perhaps a bug not a
feature) but rather the uninterruptible nature. Someone tries to add a
column. The vacuum will not yield the field but continues to vacuum the
table at a very leisurely pace. Now everything grinds to a halt, because
an otherwise-momentary access exclusive lock can't be granted, and in turns
blocks all access. If that is not bad enough, people can't tolerate (or
even understand) this behavior, so they take steps to defeat it and so
cause even more problems.

(Which in turn brings up another general issue. When one process wants a
lock but is blocked by a weaker lock held by a slow process, it should
allow other fast process to jump over it and get a weaker lock compatible
with the held one. Identifying which processes are slow and which are
likely to be fast is the problem. Some other products allow this
indication to be provided by the user, but there may be better ways.)

Cheers,

Jeff

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Melvin Davidson 2017-12-10 23:10:25 Re: Display table entries using partial column entry
Previous Message Sherman Willden 2017-12-10 21:50:01 Display table entries using partial column entry