Re: Planning performance problem (67626.278ms)

From: Jeremy Schneider <schnjere(at)amazon(dot)com>
To: "pgsql-performance(at)lists(dot)postgresql(dot)org" <pgsql-performance(at)lists(dot)postgresql(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Planning performance problem (67626.278ms)
Date: 2021-04-20 21:00:33
Message-ID: 3f5e5673-06aa-bf7c-cdb7-0e053c45ba26@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Two years later, I still remember this. And today I just confirmed
someone hitting this on open source PG13.

On the incident I saw today, the tuple reads during planning were
leading to excessive mxid waits - I suspect because memory pressure
probably caused the mxid offsets/member files to get flushed and start
reading from disk. It was the mxid waits that actually made the system
go sideways - I was just surprised that we saw most of the system CPU in
planning time (rather than execution).

I really think there's something here in the planner that's still
causing headaches for people; I've this seen a few times now. Looks to
me like the common theme is:

try_mergejoin_path ->
initial_cost_mergejoin ->
mergejoinscansel ->
scalarineqsel ->
ineq_histogram_selectivity ->
get_actual_variable_range

And from here it starts calling index_getnext() which can go on for a
very long time and the system seems to fall over if it begins to involve
much physical I/O.

I'll continue to keep an eye out for this, and keep this thread updated
if I find anything else that might move the understanding forward.

Thanks,
Jeremy Schneider

On 4/8/19 16:26, Andres Freund wrote:
> Hi,
>
> On 2019-04-08 16:10:17 -0700, Jeremy Schneider wrote:
>>
>> For what it's worth, I have seen a similar issue on Aurora PG 9.6 where
>> query planning took a very long time (multiple minutes). In this
>> particular case, there wasn't anything Aurora-specific about the call to
>> get_actual_variable_range. We weren't able to distinctly identify the
>> root cause or build a reproducible test case -- but we suspect that an
>> inefficiency might exist in community PostgreSQL code.
>>
>> For debugging, a few ideas:
>>
>> 1) capture a stack with pstack or perf record --call-graph
>> 2) capture the execution plan of the SQL w slow planning
>> 3) capture detailed stats for all relations and objects involved
>> 4) capture the usual info for bug reporting (preface section in docs)
>>
>> A reproducible test case is the gold standard; I'm keeping my eyes open
>> for another case too.
>>
>> For the slow planning case that I saw, the slow process was almost
>> entirely in this call stack (captured with perf record --call-graph):
>> ...
>> index_fetch_heap
>> index_getnext
>> get_actual_variable_range
>> ineq_histogram_selectivity
>> scalarineqsel
>> mergejoinscansel
>> initial_cost_mergejoin
>> try_mergejoin_path
>> add_paths_to_joinrel
>> make_join_rel
>> join_search_one_level
>> standard_join_search
>> make_one_rel
>> query_planner
>> ...
>
> I suspect some of this might be related to < 11 not having the following
> commit:
>
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3ca930fc39ccf987c1c22fd04a1e7463b5dd0dfd

--
Jeremy Schneider
Database Engineer
Amazon Web Services

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Pavan Pusuluri 2021-04-21 01:45:16 Re: OLEDB for PostgreSQL
Previous Message aditya desai 2021-04-19 13:10:29 Re: OLEDB for PostgreSQL