Index bloat and REINDEX/VACUUM optimization for partial index

From: jayaprabhakar k <jayaprabhakar(at)gmail(dot)com>
To: psql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Index bloat and REINDEX/VACUUM optimization for partial index
Date: 2023-08-29 00:32:38
Message-ID: CA+t=SiK+QY12dQDp81NGu6n0Uqx3gPkByM8SaMBQAnUmT-qYBg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi,

TL;DR:
Observations:

1. REINDEX requires a full table scan
- Roughly create a new index, rename index, drop old index.
- REINDEX is not incremental. running reindex frequently does not
reduce the future reindex time.
2. REINDEX does not use the index itself
3. VACUUM does not clean up the indices. (relpages >> reltuples) I
understand, vacuum is supposed to remove pages only if there are no live
tuples in the page, but somehow, even immediately after vacuum, I see
relpages significantly greater than reltuples. I would have assumed,
relpages <= reltuples
4. Query Planner does not consider index bloat, so uses highly bloated
partial index that is terribly slow over other index

Question: Is there a way to optimize postgres vacuum/reindex when using
partial indexes?

We have a large table (tasks) that keep track of all the tasks that are
created and their statuses. Around 1.4 million tasks per day are created
every day (~15 inserts per second).

One of the columns is int `status` that can be one of (1 - Init, 2 -
InProgress, 3 - Success, 4 - Aborted, 5 - Failure) (Actually, there are
more statuses, but this would give the idea)

On average, a task completes in around a minute with some outliers that can
go as long as a few weeks. There is a periodic heartbeat that updates the
last updated time in the table.

At any moment, there are *around 1000-1500 tasks in pending statuses* (Init
+ InProgress) out of around 500 million tasks.

Now, we have a task monitoring query that will look for all pending tasks
that have not received any update in the last n minutes.

```
SELECT [columns list]
FROM tasks
WHERE status NOT IN (3,4,5) AND created > NOW() - INTERVAL '30 days' AND
updated < NOW() - interval '30 minutes'
```

Since we are only interested in the pending tasks, I created a partial index
`*"tasks_pending_status_created_type_idx" btree (status, created,
task_type) WHERE status <> ALL (ARRAY[3, 4, 5])*`.

This worked great initially, however this started to get bloated very very
quickly because, every task starts in pending state, gets multiple updates
(and many of them are not HOT updates, working on optimizing fill factor
now), and eventually gets deleted from the index (as status changes to
success).

```

\d+ tasks

Table "public.tasks"
Column | Type | Collation |
Nullable | Default | Storage | Compression |
Stats target | Description
-------------------------------+----------------------------+-----------+----------+-----------------------------------+----------+-------------+--------------+-------------
id | bigint | |
not null | nextval('tasks_id_seq'::regclass) | plain | |
|
client_id | bigint | |
not null | | plain | |
|
status | integer | |
not null | | plain | |
|
description | character varying(128) | |
not null | | extended | |
|
current_count | bigint | |
not null | | plain | |
|
target_count | bigint | |
not null | | plain | |
|
status_msg | character varying(4096) | |
| | extended | |
|
blob_key | bigint | |
| | plain | |
|
created | timestamp with time zone | |
not null | | plain | |
|
updated | timestamp with time zone | |
not null | | plain | |
|
idle_time | integer | |
not null | 0 | plain | |
|
started | timestamp with time zone | |
| | plain | |
|
Indexes:
"tasks_pkey" PRIMARY KEY, btree (id)
"tasks_created_idx" btree (created)
"tasks_pending_status_created_idx" btree (status, created) WHERE status
<> ALL (ARRAY[3, 4, 5])

"tasks_client_id_status_created_idx" btree (client_id, status, created
DESC)
"tasks_status_idx" btree (status)
Access method: heap
Options: autovacuum_vacuum_scale_factor=0.02,
autovacuum_analyze_scale_factor=0.02, fillfactor=70
```

Immediately after REINDEX

```
SELECT relname,reltuples,relpages FROM pg_class WHERE relname like
'tasks%idx%';

relname | reltuples | relpages
------------------------------------+----------------+----------
tasks_pending_status_created_idx | 34175 | 171
tasks_created_idx | 5.3920026e+08 | 11288121
tasks_client_id_status_created_idx | 5.3920026e+08 | 7031615
tasks_status_idx | 5.3920026e+08 | 2215403
(9 rows)

```

A couple of days after manual full REINDEX.
```
SELECT relname, relpages, reltuples, relallvisible, relkind, relnatts,
relhassubclass, reloptions, pg_table_size(oid) FROM pg_class WHERE (relname
like 'tasks%idx%' OR relname='tasks');
relname | relpages | reltuples |
relallvisible | relkind | relnatts | relhassubclass | reloptions |
pg_table_size
------------------------------------+----------+----------------+---------------+---------+----------+----------------+------------+---------------
tasks_pending_status_created_idx | 79664 | 201831 |
0 | i | 3 | f | | 652771328
tasks_created_idx | 11384992 | 5.42238e+08 |
0 | i | 1 | f | | 93481443328
tasks_client_id_status_created_idx | 7167147 | 5.42274e+08 |
0 | i | 5 | f | | 58727710720
tasks_status_idx | 2258820 | 5.4223546e+08 |
0 | i | 1 | f | | 18508734464
tasks | 71805187 | 5.171037e+08 |
71740571 | r | 30 | f | | 613282308096
```

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Peter Geoghegan 2023-08-29 01:49:13 Re: Index bloat and REINDEX/VACUUM optimization for partial index
Previous Message Les 2023-08-28 11:47:22 Re: Slow query, possibly not using index