Re: Index bloat and REINDEX/VACUUM optimization for partial index

From: jayaprabhakar k <jayaprabhakar(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: psql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Index bloat and REINDEX/VACUUM optimization for partial index
Date: 2023-08-31 00:42:58
Message-ID: CA+t=Si+15FrugfFBineVp4Kb02Qx1kJvfK2n1m2+-J969-=sjg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Tue, Aug 29, 2023, 12:43 PM Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> On Mon, Aug 28, 2023 at 8:33 PM jayaprabhakar k <jayaprabhakar(at)gmail(dot)com>
> wrote:
>
>> Hi,
>>
>> TL;DR:
>> Observations:
>>
>> 1. REINDEX requires a full table scan
>> - Roughly create a new index, rename index, drop old index.
>> - REINDEX is not incremental. running reindex frequently does not
>> reduce the future reindex time.
>> 2. REINDEX does not use the index itself
>> 3. VACUUM does not clean up the indices. (relpages >> reltuples) I
>> understand, vacuum is supposed to remove pages only if there are no live
>> tuples in the page, but somehow, even immediately after vacuum, I see
>> relpages significantly greater than reltuples. I would have assumed,
>> relpages <= reltuples
>> 4. Query Planner does not consider index bloat, so uses highly
>> bloated partial index that is terribly slow over other index
>>
>> Your points 3 and 4 are not correct. empty index pages are put on a
> freelist for future reuse, they are not physically removed from the
> underlying index files. Maybe they are not actually getting put on the
> freelist or not being reused from the freelist for some reason, but that
> would be a different issue. Use the extension pgstattuple to see what its
> function pgstatindex says about the index.
>
> The planner does take index bloat into consideration, but its effect size
> is low. Which it should be, as empty or irrelevant pages should be
> efficiently skipped during the course of most index operations. To figure
> out what is going with your queries, you should do an EXPLAIN (ANALYZE,
> BUFFERS) of them, but with it being slow and with it being fast.
>
>
>> Question: Is there a way to optimize postgres vacuum/reindex when using
>> partial indexes?
>>
>
> Without knowing what is actually going wrong, I can only offer
> generalities. Make sure you don't have long-lived transactions which
> prevent efficient clean up. Increase the frequency on which vacuum runs on
> the table. It can't reduce the size of an already bloated index, but by
> keeping the freelist stocked it should be able prevent it from getting
> bloated in the first place. Also, it can remove empty pages from being
> linked into the index tree structure, which means they won't need to be
> scanned even though they are still in the file. It can also free up space
> inside non-empty pages for future reuse within that same page, and so that
> index tuples don't need to be chased down in the table only to be found to
> be not visible.
>
>
>> ```
>> SELECT [columns list]
>> FROM tasks
>> WHERE status NOT IN (3,4,5) AND created > NOW() - INTERVAL '30 days'
>> AND updated < NOW() - interval '30 minutes'
>> ```
>>
>> Since we are only interested in the pending tasks, I created a partial
>> index
>> `*"tasks_pending_status_created_type_idx" btree (status, created,
>> task_type) WHERE status <> ALL (ARRAY[3, 4, 5])*`.
>>
>
> This looks like a poorly designed index. Since the status condition
> exactly matches the index where clause, there is no residual point in
> having "status" be the first column in the index, it can only get in the
> way (for this particular query). Move it to the end, or remove it
> altogether.
>
Interesting. I don't understand why it will get in the way. Unfortunately
we have a few other cases where status is used in filter. That said, I will
consider how to get this to work.
Would removing status from the index column, improve HOT updates %? For
example, changing status from 1->2, doesn't change anything on the index
(assuming other criteria for HOT updates are met), but I am not sure how
the implementation is.

> Within the tuples which pass the status check, which inequality is more
> selective, the "created" one or "updated" one?
>
Obviously updated time is more selective (after status), and the created
time is included only to exclude some bugs in our system that had left some
old tasks stuck in progress (and for sorting). We do try to clean
up occasionally, but not each time.
However we cannot add an index on `updated` column because that timestamp
gets updated over 10x on average for each task. Since if a single index use
a column, then the update will not be HOT, and every index needs to be
updated. That will clearly add a bloat to every index. Did I miss something?

>
> Cheers,
>
> Jeff
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Christian Beikov 2023-08-31 08:19:10 Join order optimization
Previous Message David Rowley 2023-08-30 21:43:00 Re: Queries containing ORDER BY and LIMIT started to work slowly