Re: Vacuum statistics

From: Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>
To: Jim Nasby <jnasby(at)upgrade(dot)com>, Andrei Zubkov <zubkov(at)moonset(dot)ru>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, jian he <jian(dot)universality(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, a(dot)lepikhov(at)postgrespro(dot)ru
Subject: Re: Vacuum statistics
Date: 2024-11-02 12:22:41
Message-ID: 85b963fe-5977-43aa-9241-75b862abcc69@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 30.10.2024 01:23, Jim Nasby wrote:
> On Oct 29, 2024, at 7:40 AM, Andrei Zubkov<zubkov(at)moonset(dot)ru> wrote:
>> Hi,
>>
>> Thanks for your attention to our patch!
>>
>> On Mon, 2024-10-28 at 16:03 -0500, Jim Nasby wrote:
>>>> Yes, but as Masahiko-san pointed out, PgStat_TableCounts is almost
>>>> tripled in space. That a huge change from having no statistics on
>>>> vacuum to have it in much more detail than everything else we
>>>> currently have. I think the feasible way might be to introduce
>>>> some
>>>> most demanded statistics first then see how it goes.
>>> Looking at the stats I do think the WAL stats are probably not
>>> helpful. First, there’s nothing users can do to tune how much WAL is
>>> generated by vacuum. Second, this introduces the risk of users saying
>>> “Wow, vacuum is creating a lot of WAL! I’m going to turn it down!”,
>>> which is most likely to make matters worse. There’s already a lot of
>>> stuff that goes into WAL without any detailed logging; if we ever
>>> wanted to provide a comprehensive view of what data is in WAL that
>>> should be handled separately.
>> Yes, there is nothing we can directly do with WAL generated by vacuum,
>> but WAL generation is the part of vacuum work, and it will indirectly
>> affected by the changes of vacuum settings. So, WAL statistics is one
>> more dimension of vacuum workload. Also WAL stat is universal metric
>> which is measured cluster-wide and on the statement-level with
>> pg_stat_statements. Vacuum WAL counters will explain the part of
>> difference between those metrics. Besides vacuum WAL counters can be
>> used to locate abnormal vacuum behavior caused by a bug or the data
>> corruption. I think if the DBA is smart enough to look at vacuum WAL
>> generated stats and to understand what it means, the decision to
>> disable the autovacuum due to its WAL generation is unlikely.
> I’m generally for more stats rather than less - really just a question of how much we’re worried about stats overhead.
>
>> Anyway I think some stats can be excluded to save some memory. The
>> first candidates are the system_time and user_time fields. Those are
>> very valuable, but are measured by the rusage stats, which won't be
>> available on all platforms. I think total_time and delay_time would be
>> sufficient.
> Yeah, I considered throwing those under the bus. I agree they’re only marginally useful.
>
>> The second is the interrupts field. It is needed for monitoring to know
>> do we have them or not, so tracking them on the database level will do
>> the trick. Interrupt is quite rare event, so once the monitoring system
>> will catch one the DBA can go to the server log for the details.
> Just to confirm… by “interrupt” you mean vacuum encountered an error?
Yes it is.

I updated patches. I excluded system and user time statistics and save
number of interrupts only for database.I removed the ability to get
statistics for all tables, now they can only be obtained for an oid
table [0], as suggested here. I also renamed the statistics from
pg_stat_vacuum_tables to pg_stat_get_vacuum_tables and similarly for
indexes and databases. I noticed that that’s what they’re mostly called.
Ready for discussion.

>>> For all the views the docs should clarify that total_blks_written
>>> means blocks written by vacuum, as opposed to the background Ywriter.
>> We have the "Number of database blocks written by vacuum operations
>> performed on this table" in the docs now. Do you mean we should
>> specifically note the vacuum process here?
> The reason the stat is confusing is because it doesn’t have the meaning that the name implies. Most people that see this will think it’s actually measuring blocks dirtied, or at least something closer to that. It definitely hides the fact that many of the dirtied blocks could actually be written by the bgwriter. So an improvement to the docs would be “Number of blocks written directly by vacuum or auto vacuum. Blocks that are dirtied by a vacuum process can be written out by another process.”
>
> Which makes me realize… I think vacuum only counts a block as dirtied if it was previously clean? If so the docs for that metric need to clarify that vacuum might modify a block but not count it as having been dirtied.
I think this makes sense, but I haven't fixed it in the documentation
yet. I need time to learn this, to be honest. I'll answer later.
>>> Sadly index_vacuum_count is may not useful at all at present. At
>>> minimum you’d need to know the number of times vacuum had run in
>>> total. I realize that’s in pg_stat_all_tables, but that doesn’t help
>>> if vacuum stats are tracked or reset separately.
>> I'm in doubt - is it really possible to reset the vacuum stats
>> independent of pg_stat_all_tables?
> Most stats can be independently reset, so I was thinking these wouldn’t be an exception. If that’s not the case then I think the docs need to mention pg_stat_all_tables.(auto)vacuum_count, since it’s in a completely different view. Or better yet, include the vacuum/analyze related stats that are in pg_stat_all_tables in pg_stat_vacuum_tables.
To be honest, it was obvious to me, but we can mention it.
>>> Autovacuum will self-terminate if it would block another process
>>> (unless it’s an aggressive vacuum) - that’s definitely something that
>>> should be tracked. Not just the number of times that happens, but
>>> also stats about how much work was lost because of this.
>> Agreed.

>>> Tuning vacuum_freeze_min_age (and the MXID variant) is rather
>>> complicated. We maybe have enough stats on whether it could be set
>>> lower, but there’s no visibility on how the settings affect how often
>>> vacuum decides to be aggressive. At minimum, we should have stats on
>>> when vacuum is aggressive, especially since it significantly changes
>>> the behavior of autovac.
>> When you say "agressive" do you mean the number of times when the
>> vacuum was processing the table with the FREEZE intention? I think this
>> is needed too.
> Yes. I intentionally use the term “aggressive” (as the code does) to avoid confusion with the FREEZE option (which as I’m sure you know simply forces some GUCs to 0). Further complicating this is that auto vac will report this as “to prevent wraparound”…
>
> In any case… I’m actually leaning towards there should be a complete second set of counters for aggressive vacuums, because of how differently they work. :(

>>> Speaking of which… there should be stats on any time vacuum decided
>>> on it’s own to skip index processing due to wraparound proximity.
>> Maybe we should just count the number of times when the vacuum was
>> started to prevent wraparound?
> Unfortunately even that isn’t simple… auto vac and manual vac have different GUCs, and of course there’s the FREEZE option. And then there’s the issue that MXIDs are handled completely separately.
>
> Even ignoring all of that… by default an aggressive vacuum won’t skip indexes. That only happens when you hit vacuum_(multixact_)failsafe_age.
>
> BTW, something I’ve been mulling over is what stats related to cleanup might be tracked at a system level. I’m thinking along the lines of how often heap_prune_page or the index marking code come across a dead tuple they can’t do anything about yet because it’s still visible. While you could track that per-relation, I’m not sure how helpful that actually is since it’s really a long-running transaction problem.
>
> Similarly, it’d be nice if we had stats about how often all of the auto vac workers were occupied; something that’s also global in nature.
>
>
>
I'll see how these statistics can be calculatedand will add in the patch.

[0]
https://www.postgresql.org/message-id/CAPpHfdvSo3mfH%3D2m4ADCHAuN%3D22SnBY3TrPaPbGKTw3r_Jaw7Q%40mail.gmail.com

--
Regards,
Alena Rybakina
Postgres Professional

Attachment Content-Type Size
v11-0003-Machinery-for-grabbing-an-extended-vacuum-statistics.patch text/x-patch 19.9 KB
v11-0002-Machinery-for-grabbing-an-extended-vacuum-statistics.patch text/x-patch 42.0 KB
v11-0001-Machinery-for-grabbing-an-extended-vacuum-statistics.patch text/x-patch 67.7 KB
v11-0004-Add-documentation-about-the-system-views-that-are-us.patch text/x-patch 27.1 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alena Rybakina 2024-11-02 12:24:31 Re: Vacuum statistics
Previous Message Tatsuo Ishii 2024-11-02 11:43:10 Re: Fix for Extra Parenthesis in pgbench progress message