How to deal with analyze gathering irrelevant stats

From: Rémi Chatenay <remi(dot)chatenay(at)doctolib(dot)com>
To: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: How to deal with analyze gathering irrelevant stats
Date: 2021-01-11 15:50:12
Message-ID: CAA==dd=gp5c4oQe8ffOu7iPUrgikA508robTnUV=AXx+o5EJfQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi,

Thanks in advance for your help. I'm putting as much context and details as
possible, but let me know if you have any questions.

What?

We are experiencing some slow queries due to the query planner using an
incorrect index. It is using an unoptimized index because the stats are
computed during the night when the data is not the same as during the day.

Context

We have a table conversations like that

|id|status|user_id|

and 2 indexes:

CREATE INDEX index_conversations_on_user_id_and_status ON
public.conversations USING btree (user_id, status);

CREATE INDEX index_conversations_on_status ON public.conversations USING
btree (status)

The slow query is the following:

SELECT id FROM conversations WHERE status = 'in_progress' AND user_id = 123

We expect the query planner to use the
index_conversations_on_user_id_and_status but it sometimes uses the other
one.

What's happening ?

There are hundreds of conversations with a status 'in_progress' at a given
time during the day but virtually none during the night.

So when the analyze is run during the night, PG then thinks that using the
index_conversations_on_status will return almost no rows and so it uses
this index instead of the combined one.

When the analyze is run during the day, PG correctly uses the right index
(index_conversations_on_user_id_and_status)

[With an analyze run during the day]

Limit (cost=0.43..8.45 rows=1 width=8) (actual time=1.666..1.666 rows=0
loops=1)

-> Index Scan using index_conversations_on_user_id_and_status on
conversations (cost=0.43..8.45 rows=1 width=8) (actual_time=1.665..1.665
rows:0 loops:1)

Index Cond: ((user_id = 123) AND ((status)::text = 'in_progress'::text))

Filter: (id <> 1)

Planning Time: 8.642 ms

Execution Time: 1.693 ms

[With an analyze run during the night]

Limit (cost=0.43..8.46 rows=1 width=8) (actual time=272.812..272.812 rows=0
loops=1)

-> Index Scan using index_conversations_on_status on conversations
(cost=0.43..8.46 rows=1 width=8) (actual_time=272.812..272.812 rows:0
loops:1)

Index Cond: ((status)::text = 'in_progress'::text))

Filter: (id <> 1) AND (user_id = 123)

Rows Removed by Filter: 559

Planning Time: 0.133 ms

Execution Time: 272.886 ms

The question

We currently run a manual weekly vacuum analyze during the night. I'm
wondering what are our possible solutions. One is to manually run the
analyze during the day but is there a way to tell PG to run the auto
analyze at a given time of the day for example ? I guess we are not the
first ones to have data patterns that differ between when the analyze is
run and the query is run.

Config
Postgres version: 11Table Metadata

SELECT relname, relpages, reltuples, relallvisible, relkind, relnatts,
relhassubclass, reloptions, pg_table_size(oid) FROM pg_class WHERE
relname='conversations';

relname | relpages | reltuples | relallvisible | relkind | relnatts |
relhassubclass | reloptions | pg_table_size

-------------+----------+------------+---------------+---------+----------+----------------+------------+---------------

conversations | 930265 | 7.3366e+06 | 902732 | r | 16
| f | | 7622991872
Maintenance Setup

We have manual vacuum analyze every week during the night.

GUC Settings
Unsure what's necessary...

"autovacuum_analyze_threshold" = "50"
"autovacuum_max_workers" = "3",
"autovacuum_naptime" = "60"
"autovacuum_vacuum_threshold" = "50"

Statistics: n_distinct, MCV, histogram

SELECT (SELECT sum(x) FROM unnest(most_common_freqs) x) frac_MCV,
tablename, attname, inherited, null_frac, n_distinct,
array_length(most_common_vals,1) n_mcv, array_length(histogram_bounds,1)
n_hist, correlation FROM pg_stats WHERE attname='status' AND
tablename='conversations' ORDER BY 1 DESC;

frac_mcv | tablename | attname | inherited | null_frac | n_distinct |
n_mcv | n_hist | correlation

----------+-------------+---------+-----------+-----------+------------+-------+--------+-------------

0.999967 | conversations | status | f | 0 | 6 |
5 | | 0.967121

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Justin Pryzby 2021-01-11 16:29:04 Re: How to deal with analyze gathering irrelevant stats
Previous Message Don Seiler 2021-01-11 15:23:19 Re: High COMMIT times