Re: Autovacuum Improvements

From: "Joris Dobbelsteen" <Joris(at)familiedobbelsteen(dot)nl>
To: "Chris Browne" <cbbrowne(at)acm(dot)org>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: Autovacuum Improvements
Date: 2007-01-09 00:20:38
Message-ID: 73427AD314CC364C8DF0FFF9C4D693FF54CB@nehemiah.joris2k.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Why not collect some information from live databases and perform some
analysis on it?

Possible values required for (to be defined) vacuum heuristic,
Human classification of tables,
Updates/Transactions done (per table/db),
Growth of tables and indexes,
(all with respect to time I believe)

Collecting real data has worked for me in the past way better than
arguing about potential situations. Now you can let you model go wild
and see if it does what you thing should happen (obviously comparing
with the current autovacuum implementation).

What statistics would you need?
Who is capable & willing of capturing it?
Who is willing to do some analysis...

Just my EUR 0.02...

- Joris

>-----Original Message-----
>From: pgsql-general-owner(at)postgresql(dot)org
>[mailto:pgsql-general-owner(at)postgresql(dot)org] On Behalf Of Chris Browne
>Sent: maandag 8 januari 2007 22:30
>To: pgsql-general(at)postgresql(dot)org
>Subject: Re: [GENERAL] Autovacuum Improvements
>
>nagy(at)ecircle-ag(dot)com (Csaba Nagy) writes:
>> On Sun, 2006-12-24 at 03:03, Christopher Browne wrote:
>> [snip]
>>> Seems to me that you could get ~80% of the way by having
>the simplest
>>> "2 queue" implementation, where tables with size < some
>threshold get
>>> thrown at the "little table" queue, and tables above that
>size go to
>>> the "big table" queue.
>>
>> That would most definitely not cut it for me, I have more than 2
>> categories of tables:
>>
>> - a few small but very often updated/inserted/deleted table: these
>> must be continuously vacuumed, your "little queue" is not
>good enough
>> for that, as even the round trip between the small tables could lead
>> to bloat on them;
>
>I disagree; if we added more "work processes," that could eat
>quickly through the short end of the queue.
>
>> - a few small and moderately updated, that could live with the
>> "little queue";
>>
>> - a few big and frequently updated, but which only have a small
>> percentage of rows actively updated at any time: those could
>live with
>> the big queue;
>>
>> - the rest which are rarely updated, I would put those in a
>separate
>> queue so they won't affect the rest, cause vacuuming them is really
>> mostly not critical;
>>
>> The point is that I'm not sure there couldn't be even more
>reasons to
>> split the tables in even more queues based on the importance of
>> vacuuming them combined with update rate and their size. If
>I can set
>> up my own queues I can experiment with what works best for me... for
>> the base setup you could set up some default queues. I wonder though
>> how would you handle dynamics of tables, I mean when will a small
>> table which grows start to be considered a big table for the purpose
>> of putting it in one queue or the other ? I guess it would
>be done on
>> analyzing the table, which is also handled by autovacuum, so tables
>> with no vacuum queue settings could go to one of the 2
>default queues
>> you mention.
>
>The heuristic I was thinking of didn't involve having two
>queues, but rather just 1. By having some size information,
>work processes could eat at the queue from both ends.
>
>If you have cases where tables need to be vacuumed *really*
>frequently, then you make sure that they are being injected
>frequently, and that some of the workers are tied to Just
>Doing Small Tables.
>
>I think that *does* cover your scenario quite adequately, and
>without having to get into having a bunch of queues.
>
>The heuristic is incomplete in one other fashion, namely that
>it doesn't guarantee that tables in the middle will ever get
>"gotten to."
>That mandates having a third policy, namely to have a worker
>that goes through tables in the (singular) queue some form of
>chronological order.
>
>>> That should keep any small tables from getting "vacuum-starved."
>>>
>>> I'd think the next step would be to increase the number of queues,
>>> perhaps in a time-based fashion. There might be times when it's
>>> acceptable to vacuum 5 tables at once, so you burn thru
>little tables
>>> "like the blazes," and handle larger ones fairly promptly.
>And other
>>> times when you don't want to do *any* big tables, and limit
>a single
>>> queue to just the itty bitty ones.
>>
>> This is all nice and it would be cool if you could set it up per
>> vacuum queue. I mean how much more effort would be to allow vacuum
>> queues with generic settings like time windows with max number of
>> threads for each window, and let the user explicitly assign
>tables to
>> those queues, instead of hard coding the queues and their
>settings and
>> assign tables to them based on size or any other heuristics ?
>>
>> For the average application which needs simple settings, there could
>> be a default setup with the 2 queues you mention. If it would be
>> possible to set up some rules to assign tables to queues based on
>> their properties on analyze time, instead of explicitly assigning to
>> one queue or other, that would be nice too, and then you can
>> completely cover the default setup with those settings, and
>allow for
>> more complex setups for those who need it.
>
>My thinking has headed more towards simplifying this; two
>queues seems to be one too many :-).
>
>>> This approach allows you to stay mostly heuristic-based, as opposed
>>> to having to describe policies in gratuitous detail.
>>
>> I agree that for simple setups that would be OK, but like I said, if
>> it would be easy enough to code that heuristics, and provide
>some sane
>> setup as default, and then let the user optimize it, that would be a
>> cool solution.
>>
>> Now it's true I don't really know how would you code 'assign all
>> tables which are smaller than x rows to vacuum queue
>"little-tables"'
>> ... maybe by providing a function to the queue which
>"matches" on the
>> table ? And you can change that function ? No idea, but it
>probably can be done...
>
>Based on the three policies I've seen, it could make sense to
>assign worker policies:
>
>1. You have a worker that moves its way through the queue in
>some sort of
> sequential order, based on when the table is added to the queue, to
> guarantee that all tables get processed, eventually.
>
>2. You have workers that always pull the "cheapest" tables in the
> queue, perhaps with some sort of upper threshold that they won't go
> past.
>
>3. You have workers that alternate between eating from the two
>ends of the
> queue.
>
>Only one queue is needed, and there's only one size parameter involved.
>
>Having multiple workers of type #2 seems to me to solve the
>problem you're concerned about.
>--
>(format nil "~S(at)~S" "cbbrowne" "cbbrowne.com")
>http://cbbrowne.com/info/spiritual.html
><a href="http://www.netizen.com.au/">thorfinn(at)netizen(dot)com(dot)au</a>
>Millihelen, adj:
> The amount of beauty required to launch one ship.
>
>---------------------------(end of
>broadcast)---------------------------
>TIP 2: Don't 'kill -9' the postmaster
>

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Merlin Moncure 2007-01-09 00:27:56 Re: SELECT INTO using Views?
Previous Message John Sales 2007-01-08 23:39:39 Questions about horizontal partitioning