Re: Index bloat and REINDEX/VACUUM optimization for partial index

From: jayaprabhakar k <jayaprabhakar(at)gmail(dot)com>
To: Maxim Boguk <maxim(dot)boguk(at)gmail(dot)com>
Cc: psql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Index bloat and REINDEX/VACUUM optimization for partial index
Date: 2023-09-01 01:05:59
Message-ID: CA+t=SiKBxCcixHZJO+zxNT-93tftucDiPfSRkbbqM1179LTACw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Thanks Maxim, that's something we are considering now - keep the in
progress tasks in one table and periodically move the old and completed
tasks to an archive table.
We could use a view that unions them for most queries.

I'm not sure if that's the best alternative though, and we want to know if
there are any gotchas to worry about.

On Thu, Aug 31, 2023, 8:06 AM Maxim Boguk <maxim(dot)boguk(at)gmail(dot)com> wrote:

>
> At any moment, there are *around 1000-1500 tasks in pending statuses*
>> (Init + InProgress) out of around 500 million tasks.
>>
>> Now, we have a task monitoring query that will look for all pending tasks
>> that have not received any update in the last n minutes.
>>
>> ```
>> SELECT [columns list]
>> FROM tasks
>> WHERE status NOT IN (3,4,5) AND created > NOW() - INTERVAL '30 days'
>> AND updated < NOW() - interval '30 minutes'
>> ```
>>
>> Since we are only interested in the pending tasks, I created a partial
>> index
>> `*"tasks_pending_status_created_type_idx" btree (status, created,
>> task_type) WHERE status <> ALL (ARRAY[3, 4, 5])*`.
>>
>> This worked great initially, however this started to get bloated very
>> very quickly because, every task starts in pending state, gets multiple
>> updates (and many of them are not HOT updates, working on optimizing fill
>> factor now), and eventually gets deleted from the index (as status changes
>> to success).
>>
>
> From my experience I suspect that there is a problem with "of around 500
> million tasks."
> Autovacuum indeed cleans old dead index entries, but how many such dead
> index entries will be collected on the 500M table before autovacuum kicks
> in?
>
> With the default value of autovacuum_vacuum_scale_factor (The default is
> 0.2 (20% of table size).) index will collect like 100M outdated/dead index
> entries before autovacuum kicks in and cleans them all (in a worst case),
> and of course it will lead to huge index bloat and awful performance.
>
> Even if you scale down autovacuum_vacuum_scale_factor to some
> unreasonable low value like 0.01, the index still bloats to the 5M dead
> entries before autovacuum run, and constant vacuuming of a huge 500M table
> will put a huge load on the database server.
>
> Unfortunately there is no easy way out of this situation from database
> side, in general I recommend not trying to implement a fast pacing queue
> like load inside of a huge and constantly growing table, it never works
> well because you cannot keep up partial efficient indexes for the queue in
> a clean/non-bloated state.
>
> In my opinion the best solution is to keep list of entries to process ("*around
> 1000-1500 tasks in pending statuses")* duplicated in the separate tiny
> table (via triggers or implement it on the application level), in that case
> autovacuum will be able quickly clean dead entries from the index.
>
> Kind Regards,
> Maxim
>
>
> --
> Maxim Boguk
> Senior Postgresql DBA
>
> Phone UA: +380 99 143 0000
> Phone AU: +61 45 218 5678
>
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Jeff Janes 2023-09-01 03:01:06 Re: Index bloat and REINDEX/VACUUM optimization for partial index
Previous Message Jeff Janes 2023-08-31 16:52:31 Re: Queries containing ORDER BY and LIMIT started to work slowly