Understanding partial index selection

From: Owen Nelson <onelson(at)gmail(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Understanding partial index selection
Date: 2023-11-28 22:23:17
Message-ID: CAG-u7ziq87b2spN_sPeVktM=GFLGNFsB2dVqSY745A6cJCmr8w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi!

I've got a query running periodically which has been degrading in
performance as time goes on. I'm hoping to better understand what the
contributing factors are.

Given a table with:
```
postgres=# \d message
Table "public.message"
Column | Type | Collation | Nullable |
Default
------------+--------------------------+-----------+----------+-----------------------------
id | character varying | C | not null |
created_at | timestamp with time zone | | not null |
org_id | character varying | C | not null |
app_id | character varying | C | not null |
event_type | character varying | | not null |
uid | character varying | | |
payload | jsonb | | |
channels | jsonb | | |
expiration | timestamp with time zone | | not null | now() + '90
days'::interval
Indexes:
"pk_message" PRIMARY KEY, btree (id)
"ix_message_per_app" btree (app_id, id DESC)
"ix_message_uid_unique_app_cond" UNIQUE, btree (app_id, uid) WHERE uid
IS NOT NULL
"message_payload_not_null_pidx" btree (expiration) WHERE payload IS NOT
NULL
Foreign-key constraints:
"fk_message_app_id_application" FOREIGN KEY (app_id) REFERENCES
application(id) ON DELETE CASCADE
Referenced by:
TABLE "messagedestination" CONSTRAINT
"fk_messagedestination_msg_id_message" FOREIGN KEY (msg_id) REFERENCES
message(id) ON DELETE CASCADE
```

I periodically run a query like this:
```
UPDATE message SET payload = NULL WHERE id IN (
SELECT id FROM message
WHERE
payload IS NOT NULL
AND expiration <= now()
LIMIT 5000
FOR UPDATE SKIP LOCKED
);
```

The aim is to cull payload values after the retention period has expired.
The hope is the sub-select would leverage the index
"message_payload_not_null_pidx" but when I `EXPLAIN ANALYZE` the query, I
see a seq scan instead.

```
Update on message (cost=1773.41..44611.36 rows=5000 width=283) (actual
time=20913.192..20913.194 rows=0 loops=1)
-> Nested Loop (cost=1773.41..44611.36 rows=5000 width=283) (actual
time=20881.320..20886.541 rows=51 loops=1)
-> HashAggregate (cost=1772.85..1822.85 rows=5000 width=88)
(actual time=20881.286..20882.052 rows=51 loops=1)
Group Key: ("ANY_subquery".id)::text
-> Subquery Scan on "ANY_subquery" (cost=0.00..1760.35
rows=5000 width=88) (actual time=8425.022..20881.244 rows=51 loops=1)
-> Limit (cost=0.00..1710.35 rows=5000 width=38)
(actual time=8425.017..20881.219 rows=51 loops=1)
-> LockRows (cost=0.00..2112304.92 rows=6175068
width=38) (actual time=8425.016..20881.212 rows=51 loops=1)
-> Seq Scan on message message_1
(cost=0.00..2050554.24 rows=6175068 width=38) (actual
time=8424.977..20880.945 rows=65 loops=1)
Filter: ((payload IS NOT NULL) AND
(expiration <= now()))
Rows Removed by Filter: 37772897
-> Index Scan using pk_message on message (cost=0.56..8.56 rows=1
width=191) (actual time=0.073..0.073 rows=1 loops=51)
Index Cond: ((id)::text = ("ANY_subquery".id)::text)
Planning Time: 0.237 ms
Execution Time: 20913.310 ms
```

I have read that the planner's index preference is driven by statistics. I
looked at the correlation of both the expiration and payload columns, ~0.8
and ~0.4 respectively. I understand the planner prefers indices where the
correlation is farther from 0 (closer to the min/max of the -1, 1 range),
but I'm not sure where the threshold might be. Still, 0.4-ish feels low to
me.

I wonder if the fact the index is partial using `WHERE payload IS NOT NULL`
combined with the fact we're changing the value to NULL is sort of
"punching holes" in the index, reducing correlation over time. I'm not sure
how to test or prove this is the case.

I also wonder if the index type (btree) is inappropriate for some reason,
or what other factors I should be thinking about here.

Separately, I saw this earlier thread which describes a somewhat similar
problem, though the details are somewhat different.
https://www.postgresql.org/message-id/20230119090425.GA8608%40arp.lijzij.de

Perhaps all I'd need is to `CREATE STATISTICS` on these two columns to make
the index viable again.

I hope to just get a better understanding of what's going on. There's an
upcoming change to the schema which will likely sidestep this situation,
but there's still a lesson to learn here.

Thanks,
Owen

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2023-11-28 23:07:57 Re: Understanding partial index selection
Previous Message Erik Wienhold 2023-11-28 17:53:17 Re: Get back the number of columns of a result-set prior to JSON aggregation