Re: Slow query and wrong row estimates for CTE

From: Dane Foster <studdugie(at)gmail(dot)com>
To: Yoan SULTAN <yeyeyo(at)gmail(dot)com>
Cc: psql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Slow query and wrong row estimates for CTE
Date: 2021-02-17 20:32:23
Message-ID: CA+Wxin+xUTwvX9f2Z73a-9ZqGcb8=wbdrw=vC=H5mQZQrAqP6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Wed, Feb 17, 2021 at 1:37 PM Yoan SULTAN <yeyeyo(at)gmail(dot)com> wrote:

> *Hi all, *
>
> *This is my first post on this mailing list, I really enjoy it.*
> *I wanted to add some details and answers to this disccusion.*
>
I'm happy you've decided to join the conversation and about the fact that
you've opened up an entirely new avenue for me to investigate and learn
from. I feel like I'm about to level up my SQL-fu! 😊

> 17 févr. 2021 à 17:52, Dane Foster <studdugie(at)gmail(dot)com> a écrit :
>
>>
>> A small update (see below/inline).
>>
>>
>> On Tue, Feb 16, 2021 at 2:11 PM Dane Foster <studdugie(at)gmail(dot)com> wrote:
>>
>>> Short conclusion:
>>> Switching from CTEs to temporary tables and analyzing reduced the
>>> runtime from 15 minutes to about 1.5 minutes.
>>>
>>> *The attempt_scores table is pretty big and it is called 3 times; try to
> rewrite the query in order to reduce call to this table. for example :*
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *EXPLAIN (ANALYZE, BUFFERS)WITH reports AS ( SELECT student_id,
> assignment_id, max(score_value) FILTER (WHERE score_name =
> 'student_performance_index'),
> max(attempt_report_id) maxid,
> max(score_value) spi FROM attempt_scores GROUP BY student_id,
> assignment_id HAVING max(score_value) > 0 AND max(score_value) FILTER
> (WHERE score_name = 'student_performance_index') = max(score_value))SELECT
> avg(spi) spi, avg(CASE score_name WHEN 'digital_clinical_experience' THEN
> score_value END) dce, avg(CASE score_name WHEN
> 'tier1_subjective_data_collection' THEN score_value END) sdcFROM
> attempt_scores JOIN reports ON
> reports.maxid=attempt_scores.attempt_report_id;*
>
Given: HAVING max(score_value) > 0 AND max(score_value) FILTER (WHERE
score_name = 'student_performance_index') = max(score_value)

Why: max(score_value) FILTER (WHERE score_name =
'student_performance_index') but no FILTER clause on:
max(attempt_report_id)?

Some context for my question. I'm new to aggregate expressions therefore I
don't have a strong mental model for what's happening. So let me tell you
what I *think* is happening and you can correct me.

The new HAVING clause that you've added ensures that for each
student/assignment pair/group that we are selecting the max spi value
(i.e., score_name = 'student_performance_index'). Therefore, isn't the
FILTER clause in the SELECT section redundant? And if it's *not* redundant
then why isn't it necessary for: max(attempt_report_id)?

*Also, I would continue to increase work_mem to 200MB until the external
> merge is not required.*
> *SET WORK_MEM='200MB'; -- to change only at session level*
>
>>
>>> Longer conclusion:
>>>
>>> @Justin Pryzby <pryzby(at)telsasoft(dot)com>
>>>
>>> - I experimented w/ materializing the CTEs and it helped at the
>>> margins but did not significantly contribute to a reduction in runtime.
>>> - No clustering was required because once I switched to temporary
>>> tables the new plan no longer used the for_upsert index.
>>>
>>> @Michael Lewis <mlewis(at)entrata(dot)com>
>>>
>>> - Increasing work_mem to 100MB (up from 4MB) helped at the margins
>>> (i.e., some 100's of millisecond improvement) but did not represent a
>>> significant reduction in the runtime.
>>> - It wasn't obvious to me which window function would be appropriate
>>> for the problem I was trying to solve therefore I didn't experiment w/ that
>>> approach.
>>>
>>> I want to update/correct this statement:
>>
>>>
>>> - The selectivity of score_name='student_performance_index' was not
>>> enough for the planner to choose an index over doing a FTS.
>>>
>>> I added a partial index (WHERE
>> score_name='student_performance_index'::citext) and that had a *dramatic*
>> impact. That part of the query went from ~12 seconds to ~1 second.
>>
>
> *Another way to generate perf. gains on this query, CREATE HASH INDEX ON
> attempt_scores(score_name); --since score_name doesn't seem to have a big
> cardinality*
>
> Finally, thank you both for helping me bring this poor performing query to
>>> heel. Your insights were helpful and greatly appreciated.
>>>
>>> Sincerely,
>>>
>>> Dane
>>>
>>>
>>> On Tue, Feb 16, 2021 at 10:25 AM Dane Foster <studdugie(at)gmail(dot)com>
>>> wrote:
>>>
>>>>
>>>> On Tue, Feb 16, 2021 at 10:13 AM Michael Lewis <mlewis(at)entrata(dot)com>
>>>> wrote:
>>>>
>>>>> Sort Method:
>>>>>> external merge Disk: 30760kB
>>>>>> Worker 0: Sort
>>>>>> Method: external merge Disk: 30760kB
>>>>>> Worker 1: Sort
>>>>>> Method: external merge Disk: 30760kB
>>>>>>
>>>>>
>>>>> If you can increase work_mem, even setting it temporarily higher for
>>>>> the session or transaction, that may dramatically change the plan.
>>>>>
>>>> I will try increasing work_mem for the session later today.
>>>>
>>>>> The advice given by Justin particularly about row estimates would be
>>>>> wise to pursue.
>>>>>
>>>>
>>>>
>>>>> I'd wonder how selective that condition of score_name =
>>>>> 'student_performance_index' is in filtering out many of the 9.3
>>>>> million tuples in that table and if an index with that as the leading
>>>>> column, or just an index on that column would be helpful.
>>>>>
>>>> There are 1,206,355 rows where score_name='student_performance_idex'.
>>>>
>>>>> You'd need to look at pg_stats for the table and see how many distinct
>>>>> values, and if student_performance_index is relatively high or low (or not
>>>>> present) in the MCVs list.
>>>>>
>>>> I will look into that.
>>>>
>>>>
>>>>> I am not sure if your query does what you want it to do as I admit I
>>>>> didn't follow your explanation of the desired behavior. My hunch is that
>>>>> you want to make use of a window function and get rid of one of the CTEs.
>>>>>
>>>> If you could tell me what part(s) are unclear I would appreciate it so
>>>> that I can write a better comment.
>>>>
>>>> Thank you sooo much for all the feedback. It is *greatly* appreciated!
>>>> Sincerely,
>>>>
>>>> Dane
>>>>
>>>>
>
> --
> Regards,
> Yo.
>
Again, thanks for joining the conversation. I look forward to hearing from
you.

Sincerely,

Dane

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Yoan SULTAN 2021-02-18 04:33:11 Re: Slow query and wrong row estimates for CTE
Previous Message Yoan SULTAN 2021-02-17 18:37:29 Re: Slow query and wrong row estimates for CTE