From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Kevin Grittner <kgrittn(at)mail(dot)com> |
Cc: | Jan Wieck <JanWieck(at)yahoo(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: autovacuum truncate exclusive lock round two |
Date: | 2012-12-05 19:00:45 |
Message-ID: | CA+TgmobD6XOw1nZSdbFYVGWa=NzAQMuXO-4qQ2yxodwFohE+bA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Dec 5, 2012 at 11:24 AM, Kevin Grittner <kgrittn(at)mail(dot)com> wrote:
> Robert Haas wrote:
>> Since people *already* raise deadlock_timeout to obscenely high
>> values (a minute? an hour???) and then complain that things blow
>> up in their face, I think there's a decent argument to be made
>> that piggybacking anything else on that setting is unwise.
>
> If people are really doing that, then I tend to agree. I wasn't
> aware of that practice.
It's probably not quite common enough to be called a "practice", but I
have encountered it a number of times in support situations. Alas, I
no longer remember the details of exactly what misery it caused, but I
do remember it wasn't good. :-)
>> Against that, FWICT, this problem only affects a small number of
>> users: Jan is the only person I can ever remember reporting this
>> issue. I'm not dumb enough to think he's the only person who it
>> affects; but my current belief is that it's not an enormously
>> common problem. So the main argument I can see against adding a
>> GUC is that the problem is too marginal to justify a setting of
>> its own. What I really see as the key issue is: suppose we
>> hardcode this to say 2 seconds. Is that going to fix the problem
>> effectively for 99% of the people who have this problem, or for
>> 25% of the people who have this problem? In the former case, we
>> probably don't need a GUC; in the latter case, we probably do.
>
> Given the fact that autovacuum will keep throwing workers at it to
> essentially loop indefinitely at an outer level, I don't think the
> exact setting of this interval is all that critical either. My gut
> feel is that anything in the 2 second to 5 second range would be
> sane, so I won't argue over any explicit setting within that range.
> Below that, I think the overhead of autovacuum coming back to the
> table repeatedly would probably start to get too high; below that
> we could be causing some small, heavily-updated table to be
> neglected by autovacuum -- especially if you get multiple
> autovacuum workers tied up in this delay on different tables at the
> same time.
I think that part of what's tricky here is that the dynamics of this
problem depend heavily on table size. I handled one support case
where lowering autovacuum_naptime to 15s was an indispenable part of
the solution, so in that case having an autovacuum worker retry for
more than a few seconds sounds kind of insane. OTOH, that case
involved a small, rapidly changing table. If you've got an enormous
table where vacuum takes an hour to chug through all of it, abandoning
the effort to truncate the table after a handful of seconds might
sound equally insane.
Many it'd be sensible to relate the retry time to the time spend
vacuuming the table. Say, if the amount of time spent retrying
exceeds 10% of the time spend vacuuming the table, with a minimum of
1s and a maximum of 1min, give up. That way, big tables will get a
little more leeway than small tables, which is probably appropriate.
> Regarding how many people are affected, I have seen several reports
> of situations where users claim massive impact on performance when
> autovacuum kicks in. The reports have not included enough detail to
> quantify the impact or in most cases to establish a cause, but this
> seems like it could have a noticable impact, especially if the
> deadlock timeout was set to more than a second.
Yeah, I agree this could be a cause of those types of reports, but I
don't have any concrete evidence that any of the cases I've worked
were actually due to this specific issue. The most recent case of
this type I worked on was due to I/O saturation - which, since it
happened to be EC2, really meant network saturation.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2012-12-05 19:01:57 | Re: Dumping an Extension's Script |
Previous Message | Robert Haas | 2012-12-05 18:50:27 | Re: Dumping an Extension's Script |