Re: BUG #17652: Performance degradation after migrating from v13.4 to v14.4

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Michael Guissine <mguissine(at)gmail(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17652: Performance degradation after migrating from v13.4 to v14.4
Date: 2022-10-20 05:03:00
Message-ID: CAMkU=1xApkQP_apew7=OkUq+RAWgAZqUg7ayJ5LqE1J0aoB=_w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Wed, Oct 19, 2022 at 3:44 PM Michael Guissine <mguissine(at)gmail(dot)com>
wrote:

> paritioned_table_hash_p3_partition_key_text_idx is in fact pg_trgm index,
> you are right!
> interesting, why would it use that index at all on some partitions but not
> on others? the query is a simple match on 3 columns and we have other
> indexes to cover it
>

The two index choices are (falsely) similar in cost, so just small
differences in the selectivity estimates is enough to drive one over the
other from partition to partition.

There are (at least) two planner deficiencies at work here.

For the actually faster index, it thinks it will need to re-descend the
index one time for every value in the IN-list on the leading column, and
each of those re-descents is going to end up on a different leaf page,
charging about one random_io_cost for each one. But it is counting a
re-descent for every item in the IN-list, for every partition. But it only
needs to re-descend for the items of the IN-list which also happen to match
the remainder for that hash partition. This is incorrectly inflating the
cost of the faster index. (You could argue that if the executor really
does do the re-descents, then it is correct for the planner to account for
them. But I suspect that many of the hopeless re-descents may end up on
leaf pages already in memory. Anyway, fixing the planner and fixing the
executor would probably go hand in hand)

For the slower index, the gin code doesn't estimate how many rechecks there
will be, only how many end results there will be. But the rechecks are
where most of the true cost is. So that makes this one end up with an
incorrectly low cost. The overestimation of one and the underestimation of
the other bring the two estimates close enough together that random
variations in the stats can drive the choice between them. Now you could
say "all the row estimates are 1, so how could there be any variation"?
But notice that for the bitmap heap scans, there is a filter step which
removes around 1000 rows from each partition. We know how many rows were
actually removed from each partition, but we are not told how many rows the
planner thought it would remove. Variations in this unseen estimated
filtered number would lead to variations in the cost estimate.

While fixing the planner would be nice, I think the best current solution
for you is to change the order of columns in the 3-column index to be
(text, start_time, key). Since the condition on "text" is extremely
selective, the planner should know that by the time it gets to the key
column it can check it with a single hash-table probe rather than a bunch
of costly re-descents. This should be estimated to be faster than either
of the existing two indexes, and also should actually be faster.

Cheers,

Jeff

>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message hubert depesz lubaczewski 2022-10-20 11:55:16 Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Previous Message Richard Guo 2022-10-20 03:32:07 Re: Documentation error