Re: Slow query and wrong row estimates for CTE

From: Yoan SULTAN <yeyeyo(at)gmail(dot)com>
To: Dane Foster <studdugie(at)gmail(dot)com>
Cc: psql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Slow query and wrong row estimates for CTE
Date: 2021-02-18 04:33:11
Message-ID: CAPJFe0Z5cm=Z=+ZJmTpvWvpfQf=6+DW=3WrkwWuARcVRJ5au7Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

*You are totally right, the max(score_value) FILTER (WHERE score_name =
'student_performance_index') in the SELECT clause is redundant.*

Le mer. 17 févr. 2021 à 21:33, Dane Foster <studdugie(at)gmail(dot)com> a écrit :

> On Wed, Feb 17, 2021 at 1:37 PM Yoan SULTAN <yeyeyo(at)gmail(dot)com> wrote:
>
>> *Hi all, *
>>
>> *This is my first post on this mailing list, I really enjoy it.*
>> *I wanted to add some details and answers to this disccusion.*
>>
> I'm happy you've decided to join the conversation and about the fact that
> you've opened up an entirely new avenue for me to investigate and learn
> from. I feel like I'm about to level up my SQL-fu! 😊
>
>
>> 17 févr. 2021 à 17:52, Dane Foster <studdugie(at)gmail(dot)com> a écrit :
>>
>>>
>>> A small update (see below/inline).
>>>
>>>
>>> On Tue, Feb 16, 2021 at 2:11 PM Dane Foster <studdugie(at)gmail(dot)com> wrote:
>>>
>>>> Short conclusion:
>>>> Switching from CTEs to temporary tables and analyzing reduced the
>>>> runtime from 15 minutes to about 1.5 minutes.
>>>>
>>>> *The attempt_scores table is pretty big and it is called 3 times; try
>> to rewrite the query in order to reduce call to this table. for example :*
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *EXPLAIN (ANALYZE, BUFFERS)WITH reports AS ( SELECT student_id,
>> assignment_id, max(score_value) FILTER (WHERE score_name =
>> 'student_performance_index'),
>> max(attempt_report_id) maxid,
>> max(score_value) spi FROM attempt_scores GROUP BY student_id,
>> assignment_id HAVING max(score_value) > 0 AND max(score_value) FILTER
>> (WHERE score_name = 'student_performance_index') = max(score_value))SELECT
>> avg(spi) spi, avg(CASE score_name WHEN 'digital_clinical_experience' THEN
>> score_value END) dce, avg(CASE score_name WHEN
>> 'tier1_subjective_data_collection' THEN score_value END) sdcFROM
>> attempt_scores JOIN reports ON
>> reports.maxid=attempt_scores.attempt_report_id;*
>>
> Given: HAVING max(score_value) > 0 AND max(score_value) FILTER (WHERE
> score_name = 'student_performance_index') = max(score_value)
>
> Why: max(score_value) FILTER (WHERE score_name =
> 'student_performance_index') but no FILTER clause on:
> max(attempt_report_id)?
>

> Some context for my question. I'm new to aggregate expressions therefore I
> don't have a strong mental model for what's happening. So let me tell you
> what I *think* is happening and you can correct me.
>
> The new HAVING clause that you've added ensures that for each
> student/assignment pair/group that we are selecting the max spi value
> (i.e., score_name = 'student_performance_index'). Therefore, isn't the
> FILTER clause in the SELECT section redundant? And if it's *not*
> redundant then why isn't it necessary for: max(attempt_report_id)?
>

>
> *Also, I would continue to increase work_mem to 200MB until the external
>> merge is not required.*
>> *SET WORK_MEM='200MB'; -- to change only at session level*
>>
>>>
>>>> Longer conclusion:
>>>>
>>>> @Justin Pryzby <pryzby(at)telsasoft(dot)com>
>>>>
>>>> - I experimented w/ materializing the CTEs and it helped at the
>>>> margins but did not significantly contribute to a reduction in runtime.
>>>> - No clustering was required because once I switched to temporary
>>>> tables the new plan no longer used the for_upsert index.
>>>>
>>>> @Michael Lewis <mlewis(at)entrata(dot)com>
>>>>
>>>> - Increasing work_mem to 100MB (up from 4MB) helped at the margins
>>>> (i.e., some 100's of millisecond improvement) but did not represent a
>>>> significant reduction in the runtime.
>>>> - It wasn't obvious to me which window function would be
>>>> appropriate for the problem I was trying to solve therefore I didn't
>>>> experiment w/ that approach.
>>>>
>>>> I want to update/correct this statement:
>>>
>>>>
>>>> - The selectivity of score_name='student_performance_index' was not
>>>> enough for the planner to choose an index over doing a FTS.
>>>>
>>>> I added a partial index (WHERE
>>> score_name='student_performance_index'::citext) and that had a
>>> *dramatic* impact. That part of the query went from ~12 seconds to ~1
>>> second.
>>>
>>
>> *Another way to generate perf. gains on this query, CREATE HASH INDEX ON
>> attempt_scores(score_name); --since score_name doesn't seem to have a big
>> cardinality*
>>
>> Finally, thank you both for helping me bring this poor performing query
>>>> to heel. Your insights were helpful and greatly appreciated.
>>>>
>>>> Sincerely,
>>>>
>>>> Dane
>>>>
>>>>
>>>> On Tue, Feb 16, 2021 at 10:25 AM Dane Foster <studdugie(at)gmail(dot)com>
>>>> wrote:
>>>>
>>>>>
>>>>> On Tue, Feb 16, 2021 at 10:13 AM Michael Lewis <mlewis(at)entrata(dot)com>
>>>>> wrote:
>>>>>
>>>>>> Sort Method:
>>>>>>> external merge Disk: 30760kB
>>>>>>> Worker 0: Sort
>>>>>>> Method: external merge Disk: 30760kB
>>>>>>> Worker 1: Sort
>>>>>>> Method: external merge Disk: 30760kB
>>>>>>>
>>>>>>
>>>>>> If you can increase work_mem, even setting it temporarily higher for
>>>>>> the session or transaction, that may dramatically change the plan.
>>>>>>
>>>>> I will try increasing work_mem for the session later today.
>>>>>
>>>>>> The advice given by Justin particularly about row estimates would be
>>>>>> wise to pursue.
>>>>>>
>>>>>
>>>>>
>>>>>> I'd wonder how selective that condition of score_name =
>>>>>> 'student_performance_index' is in filtering out many of the 9.3
>>>>>> million tuples in that table and if an index with that as the leading
>>>>>> column, or just an index on that column would be helpful.
>>>>>>
>>>>> There are 1,206,355 rows where score_name='student_performance_idex'.
>>>>>
>>>>>> You'd need to look at pg_stats for the table and see how many
>>>>>> distinct values, and if student_performance_index is relatively high or low
>>>>>> (or not present) in the MCVs list.
>>>>>>
>>>>> I will look into that.
>>>>>
>>>>>
>>>>>> I am not sure if your query does what you want it to do as I admit I
>>>>>> didn't follow your explanation of the desired behavior. My hunch is that
>>>>>> you want to make use of a window function and get rid of one of the CTEs.
>>>>>>
>>>>> If you could tell me what part(s) are unclear I would appreciate it so
>>>>> that I can write a better comment.
>>>>>
>>>>> Thank you sooo much for all the feedback. It is *greatly* appreciated!
>>>>> Sincerely,
>>>>>
>>>>> Dane
>>>>>
>>>>>
>>
>> --
>> Regards,
>> Yo.
>>
> Again, thanks for joining the conversation. I look forward to hearing from
> you.
>
> Sincerely,
>
> Dane
>

--
Regards,
Yo.

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message M Tarkeshwar Rao 2021-02-19 10:51:32 RE: Autovacuum not functioning for large tables but it is working for few other small tables.
Previous Message Dane Foster 2021-02-17 20:32:23 Re: Slow query and wrong row estimates for CTE