From: | Jeremy Schneider <schnjere(at)amazon(dot)com> |
---|---|
To: | "pgsql-performance(at)lists(dot)postgresql(dot)org" <pgsql-performance(at)lists(dot)postgresql(dot)org> |
Cc: | Andres Freund <andres(at)anarazel(dot)de> |
Subject: | Re: Planning performance problem (67626.278ms) |
Date: | 2021-04-20 21:00:33 |
Message-ID: | 3f5e5673-06aa-bf7c-cdb7-0e053c45ba26@amazon.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Two years later, I still remember this. And today I just confirmed
someone hitting this on open source PG13.
On the incident I saw today, the tuple reads during planning were
leading to excessive mxid waits - I suspect because memory pressure
probably caused the mxid offsets/member files to get flushed and start
reading from disk. It was the mxid waits that actually made the system
go sideways - I was just surprised that we saw most of the system CPU in
planning time (rather than execution).
I really think there's something here in the planner that's still
causing headaches for people; I've this seen a few times now. Looks to
me like the common theme is:
try_mergejoin_path ->
initial_cost_mergejoin ->
mergejoinscansel ->
scalarineqsel ->
ineq_histogram_selectivity ->
get_actual_variable_range
And from here it starts calling index_getnext() which can go on for a
very long time and the system seems to fall over if it begins to involve
much physical I/O.
I'll continue to keep an eye out for this, and keep this thread updated
if I find anything else that might move the understanding forward.
Thanks,
Jeremy Schneider
On 4/8/19 16:26, Andres Freund wrote:
> Hi,
>
> On 2019-04-08 16:10:17 -0700, Jeremy Schneider wrote:
>>
>> For what it's worth, I have seen a similar issue on Aurora PG 9.6 where
>> query planning took a very long time (multiple minutes). In this
>> particular case, there wasn't anything Aurora-specific about the call to
>> get_actual_variable_range. We weren't able to distinctly identify the
>> root cause or build a reproducible test case -- but we suspect that an
>> inefficiency might exist in community PostgreSQL code.
>>
>> For debugging, a few ideas:
>>
>> 1) capture a stack with pstack or perf record --call-graph
>> 2) capture the execution plan of the SQL w slow planning
>> 3) capture detailed stats for all relations and objects involved
>> 4) capture the usual info for bug reporting (preface section in docs)
>>
>> A reproducible test case is the gold standard; I'm keeping my eyes open
>> for another case too.
>>
>> For the slow planning case that I saw, the slow process was almost
>> entirely in this call stack (captured with perf record --call-graph):
>> ...
>> index_fetch_heap
>> index_getnext
>> get_actual_variable_range
>> ineq_histogram_selectivity
>> scalarineqsel
>> mergejoinscansel
>> initial_cost_mergejoin
>> try_mergejoin_path
>> add_paths_to_joinrel
>> make_join_rel
>> join_search_one_level
>> standard_join_search
>> make_one_rel
>> query_planner
>> ...
>
> I suspect some of this might be related to < 11 not having the following
> commit:
>
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3ca930fc39ccf987c1c22fd04a1e7463b5dd0dfd
--
Jeremy Schneider
Database Engineer
Amazon Web Services
From | Date | Subject | |
---|---|---|---|
Next Message | Pavan Pusuluri | 2021-04-21 01:45:16 | Re: OLEDB for PostgreSQL |
Previous Message | aditya desai | 2021-04-19 13:10:29 | Re: OLEDB for PostgreSQL |