Re: Bad query plan decision when using multiple column index - postgresql uses only first column then filters

From: Cosmin Prund <cprund(at)gmail(dot)com>
To: Pgsql Performance <pgsql-performance(at)lists(dot)postgresql(dot)org>
Subject: Re: Bad query plan decision when using multiple column index - postgresql uses only first column then filters
Date: 2020-01-16 19:52:15
Message-ID: CAGU4dz9YM7Ldgws=Sq1nxwuXT612rDxjXjg-9OJhxJz83Ardpg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

most_common_values before the ANALYZE had 22 values. After ANALYZE it has
23 values.
After ANALYZE I get an entry for "92" with 0.0441333 frequency (the
frequency is about right).

The stats target for the "Ver" column is already at 10000. I'm going to
have to bring the stats target back on everything, but I'm not sure about
this. The life-cycle of this table is a bit special. Once-in-a-while a new
"Version" is created: 1 to 3 million records are inserted at once, all with
the same Version and with sequential Id-s (re-starting from 1 with each
version). The unfortunate side-effect is that I get huge clusters of
records with the same "Ver". I created a script that calculates the correct
"n_distinct" value for the column and repeatedly runs ANALYZE until the
reported "n_distinct" value is grater then 75% of the correct number; on
each loop of the script the stats target is increased by 5%. I thought this
would help me find a good value for the stats target but it invariably
brings the stats target all the way up to 10000.

Finally I have one last option: take "stats" into my own hands. Since
inserting anything into those tables is such a big (but rare and well
defined) event, I could simply set the stats target to ZERO and compute
correct values on my own after pushing a new version. The issue here is
that I don't understand the system well-enough to make this work.

Hopefully I'll be able to reproduce this on a backup of the database so I
can safely experiment. Until I manage to reproduce this I don't think I can
make any more progress, so thank you everyone for the help.

On Thu, 16 Jan 2020 at 20:55, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Cosmin Prund <cprund(at)gmail(dot)com> writes:
> > Running the same query with a different "Ver" produces a proper plan.
>
> Oh, that *is* interesting.
>
> After studying the code a bit more I see why this is possible when I
> originally thought not. The case that you are interested in is one that
> has special handling -- it's a "lower-order ScalarArrayOpExpr" in the
> terms of the code. This means that get_index_paths will actually produce
> two index paths, one with the IN clause as an indexqual and one without,
> because it expects that they have different sort behaviors [1]. So then
> we do have a chance for a cost-based choice, making it possible for the
> estimated selectivity of the higher-order clause to affect the outcome.
>
> I'm still a bit surprised that it wouldn't choose the alternative with
> the IN ... but if the estimated number of rows matching just the first
> column is small enough, it might see the paths as having indistinguishable
> costs, and then it's down to luck which it chooses.
>
> > There are 25 valid values for "Ver" in this database. I ran the query for
> > all of them. The only one miss-behaving is "92". I ran the query with
> > random values for Ver (invalid values), the query plan always attempts to
> > use the index using both values.
> > I looked into "most_common_values" in pg_stats, this value (92) is not in
> > that list.
>
> Are the other 24 all in the list?
>
> > Finally I ran "ANALYZE" again and now the problem went away. Running the
> > query with Ver=92 uses the proper plan. I'm not happy with this - I know
> I
> > haven't solved the problem (I've ran ANALYZE multiple times before).
>
> Maybe increasing the stats target for the "Ver" column would help. It
> sounds like you want to get to a point where all the valid values are
> given in the MCV list, so that the estimates for them will be accurate.
>
> regards, tom lane
>
> [1] Right at the moment, it seems like that's wrong and we could just
> generate one path. Need to study this.
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Kemal Ortanca 2020-01-27 13:15:59 Query optimization advice for beginners
Previous Message Cosmin Prund 2020-01-16 19:06:38 Re: Bad query plan decision when using multiple column index - postgresql uses only first column then filters