Re: Slow query and wrong row estimates for CTE

From: Yoan SULTAN <yeyeyo(at)gmail(dot)com>
To: Dane Foster <studdugie(at)gmail(dot)com>
Cc: psql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Slow query and wrong row estimates for CTE
Date: 2021-02-17 18:37:29
Message-ID: CAPJFe0ZqhsYPc=vdr3uKOUUT31oitFoChmsxgJOnJoYdyCUA_w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

*Hi all, *

*This is my first post on this mailing list, I really enjoy it.*
*I wanted to add some details and answers to this disccusion.*

17 févr. 2021 à 17:52, Dane Foster <studdugie(at)gmail(dot)com> a écrit :

>
> A small update (see below/inline).
>
>
> On Tue, Feb 16, 2021 at 2:11 PM Dane Foster <studdugie(at)gmail(dot)com> wrote:
>
>> Short conclusion:
>> Switching from CTEs to temporary tables and analyzing reduced the runtime
>> from 15 minutes to about 1.5 minutes.
>>
>> *The attempt_scores table is pretty big and it is called 3 times; try to
rewrite the query in order to reduce call to this table. for example :*

*EXPLAIN (ANALYZE, BUFFERS)WITH reports AS ( SELECT student_id,
assignment_id, max(score_value) FILTER (WHERE score_name =
'student_performance_index'),
max(attempt_report_id) maxid,
max(score_value) spi FROM attempt_scores GROUP BY student_id,
assignment_id HAVING max(score_value) > 0 AND max(score_value) FILTER
(WHERE score_name = 'student_performance_index') = max(score_value))SELECT
avg(spi) spi, avg(CASE score_name WHEN 'digital_clinical_experience' THEN
score_value END) dce, avg(CASE score_name WHEN
'tier1_subjective_data_collection' THEN score_value END) sdcFROM
attempt_scores JOIN reports ON
reports.maxid=attempt_scores.attempt_report_id;*

*Also, I would continue to increase work_mem to 200MB until the external
merge is not required.*
*SET WORK_MEM='200MB'; -- to change only at session level*

>
>> Longer conclusion:
>>
>> @Justin Pryzby <pryzby(at)telsasoft(dot)com>
>>
>> - I experimented w/ materializing the CTEs and it helped at the
>> margins but did not significantly contribute to a reduction in runtime.
>> - No clustering was required because once I switched to temporary
>> tables the new plan no longer used the for_upsert index.
>>
>> @Michael Lewis <mlewis(at)entrata(dot)com>
>>
>> - Increasing work_mem to 100MB (up from 4MB) helped at the margins
>> (i.e., some 100's of millisecond improvement) but did not represent a
>> significant reduction in the runtime.
>> - It wasn't obvious to me which window function would be appropriate
>> for the problem I was trying to solve therefore I didn't experiment w/ that
>> approach.
>>
>> I want to update/correct this statement:
>
>>
>> - The selectivity of score_name='student_performance_index' was not
>> enough for the planner to choose an index over doing a FTS.
>>
>> I added a partial index (WHERE
> score_name='student_performance_index'::citext) and that had a *dramatic*
> impact. That part of the query went from ~12 seconds to ~1 second.
>

*Another way to generate perf. gains on this query, CREATE HASH INDEX ON
attempt_scores(score_name); --since score_name doesn't seem to have a big
cardinality*

Finally, thank you both for helping me bring this poor performing query to
>> heel. Your insights were helpful and greatly appreciated.
>>
>> Sincerely,
>>
>> Dane
>>
>>
>> On Tue, Feb 16, 2021 at 10:25 AM Dane Foster <studdugie(at)gmail(dot)com> wrote:
>>
>>>
>>> On Tue, Feb 16, 2021 at 10:13 AM Michael Lewis <mlewis(at)entrata(dot)com>
>>> wrote:
>>>
>>>> Sort Method:
>>>>> external merge Disk: 30760kB
>>>>> Worker 0: Sort
>>>>> Method: external merge Disk: 30760kB
>>>>> Worker 1: Sort
>>>>> Method: external merge Disk: 30760kB
>>>>>
>>>>
>>>> If you can increase work_mem, even setting it temporarily higher for
>>>> the session or transaction, that may dramatically change the plan.
>>>>
>>> I will try increasing work_mem for the session later today.
>>>
>>>> The advice given by Justin particularly about row estimates would be
>>>> wise to pursue.
>>>>
>>>
>>>
>>>> I'd wonder how selective that condition of score_name =
>>>> 'student_performance_index' is in filtering out many of the 9.3
>>>> million tuples in that table and if an index with that as the leading
>>>> column, or just an index on that column would be helpful.
>>>>
>>> There are 1,206,355 rows where score_name='student_performance_idex'.
>>>
>>>> You'd need to look at pg_stats for the table and see how many distinct
>>>> values, and if student_performance_index is relatively high or low (or not
>>>> present) in the MCVs list.
>>>>
>>> I will look into that.
>>>
>>>
>>>> I am not sure if your query does what you want it to do as I admit I
>>>> didn't follow your explanation of the desired behavior. My hunch is that
>>>> you want to make use of a window function and get rid of one of the CTEs.
>>>>
>>> If you could tell me what part(s) are unclear I would appreciate it so
>>> that I can write a better comment.
>>>
>>> Thank you sooo much for all the feedback. It is *greatly* appreciated!
>>> Sincerely,
>>>
>>> Dane
>>>
>>>

--
Regards,
Yo.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Dane Foster 2021-02-17 20:32:23 Re: Slow query and wrong row estimates for CTE
Previous Message Dane Foster 2021-02-17 16:51:10 Re: Slow query and wrong row estimates for CTE