Re: cost delay brainstorming

From: Jay <jsudrikoss(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: cost delay brainstorming
Date: 2024-10-22 04:32:29
Message-ID: CAPdcCKo=fiTJKtrfC2JZECL6L+jW77wUfy-2UbcEeFdi3fTZtA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I had suggested something more that just cost limit, throttling which would
be re-startable vacuum -
https://www.postgresql.org/message-id/CAPdcCKpvZiRCoDxQoo9mXxXAK8w=bX5NQdTTgzvHV2sUXp0ihA@mail.gmail.com
.

It may not be difficult to predict patterns of idle periods with cloud
infrastructure and monitoring now-a-days. If we keep manual vacuum going in
those idle periods, then there would be much less chance of auto-vacuum
getting disruptive. This can be built with extensions or inside the engine.

However, this change is a bit bigger than just a config parameter. It
didn't get much traction.

- Jay Sudrik

On Tue, Jun 18, 2024 at 1:09 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> Hi,
>
> As I mentioned in my talk at 2024.pgconf.dev, I think that the biggest
> problem with autovacuum as it exists today is that the cost delay is
> sometimes too low to keep up with the amount of vacuuming that needs
> to be done. I sketched a solution during the talk, but it was very
> complicated, so I started to try to think of simpler ideas that might
> still solve the problem, or at least be better than what we have
> today.
>
> I think we might able to get fairly far by observing that if the
> number of running autovacuum workers is equal to the maximum allowable
> number of running autovacuum workers, that may be a sign of trouble,
> and the longer that situation persists, the more likely it is that
> we're in trouble. So, a very simple algorithm would be: If the maximum
> number of workers have been running continuously for more than, say,
> 10 minutes, assume we're falling behind and exempt all workers from
> the cost limit for as long as the situation persists. One could
> criticize this approach on the grounds that it causes a very sudden
> behavior change instead of, say, allowing the rate of vacuuming to
> gradually increase. I'm curious to know whether other people think
> that would be a problem.
>
> I think it might be OK, for a couple of reasons:
>
> 1. I'm unconvinced that the vacuum_cost_delay system actually prevents
> very many problems. I've fixed a lot of problems by telling users to
> raise the cost limit, but virtually never by lowering it. When we
> lowered the delay by an order of magnitude a few releases ago -
> equivalent to increasing the cost limit by an order of magnitude - I
> didn't personally hear any complaints about that causing problems. So
> disabling the delay completely some of the time might just be fine.
>
> 1a. Incidentally, when I have seen problems because of vacuum running
> "too fast", it's not been because it was using up too much I/O
> bandwidth, but because it's pushed too much data out of cache too
> quickly. A long overnight vacuum can evict a lot of pages from the
> system page cache by morning - the ring buffer only protects our
> shared_buffers, not the OS cache. I don't think this can be fixed by
> rate-limiting vacuum, though: to keep the cache eviction at a level
> low enough that you could be certain of not causing trouble, you'd
> have to limit it to an extremely low rate which would just cause
> vacuuming not to keep up. The cure would be worse than the disease at
> that point.
>
> 2. If we decided to gradually increase the rate of vacuuming instead
> of just removing the throttling all at once, what formula would we use
> and why would that be the right idea? We'd need a lot of state to
> really do a calculation of how fast we would need to go in order to
> keep up, and that starts to rapidly turn into a very complicated
> project along the lines of what I mooted in Vancouver. Absent that,
> the only other idea I have is to gradually ramp up the cost limit
> higher and higher, which we could do, but we would have no idea how
> fast to ramp it up, so anything we do here feels like it's just
> picking random numbers and calling them an algorithm.
>
> If you like this idea, I'd like to know that, and hear any further
> thoughts you have about how to improve or refine it. If you don't, I'd
> like to know that, too, and any alternatives you can propose,
> especially alternatives that don't require crazy amounts of new
> infrastructure to implement.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>
>
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2024-10-22 04:53:43 Re: Row pattern recognition
Previous Message Noah Misch 2024-10-22 03:52:15 Re: race condition in pg_class