Re: Weird CASE WHEN behaviour causing query to be suddenly very slow

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Kevin Viraud <kevin(dot)viraud(at)rocket-internet(dot)de>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Weird CASE WHEN behaviour causing query to be suddenly very slow
Date: 2015-03-31 09:41:48
Message-ID: CAFj8pRDHVVCArELTxkobiok0oFaWLq26_rqgM_dBd1bJJQyqKA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

2015-03-31 11:19 GMT+02:00 Kevin Viraud <kevin(dot)viraud(at)rocket-internet(dot)de>:

> Hi Pavel,
>
>
>
> Thanks for your answer.
>
>
>
> Yes sure, I could do that, but like I wrote the purpose is not to find a
> way to rewrite it. But to understand why at a certain point it is totally
> going off. I’m aware that the longer my case when will be the longest the
> query will run. But 10x slower for adding one condition, something feels
> wrong here.
>

It is slow due lot of expressions evaluation. It is CPU expensive.
PostgreSQL uses interpreted expression evaluation - and if you have lot of
expressions, then you have problem.

Regards

Pavel

>
>
> Plus, the case when is part of a function so basically I use it this way :
>
> SELECT col1, col2, get_channel(company_id, source_id, …)
>
> FROM mytable;
>
>
>
> Get_channel is coming from another app. And even though I have, I need to
> assume that I don’t have the control over this one and that I’m using it as
> if.
>
>
>
> This is only my debugging query.
>
>
>
> Best regards,
>
>
>
> Kevin
>
>
>
> *From:* Pavel Stehule [mailto:pavel(dot)stehule(at)gmail(dot)com]
> *Sent:* Dienstag, 31. März 2015 11:09
> *To:* Kevin Viraud
> *Cc:* pgsql-performance(at)postgresql(dot)org
> *Subject:* Re: [PERFORM] Weird CASE WHEN behaviour causing query to be
> suddenly very slow
>
>
>
> Hi
>
> long CASE can be problem. Why you don't use a dictionary table and join?
>
> Regards
>
> Pavel
>
>
>
> 2015-03-31 10:53 GMT+02:00 Kevin Viraud <kevin(dot)viraud(at)rocket-internet(dot)de>:
>
> Hi,
>
>
>
> I have an issue with a rather large CASE WHEN and I cannot figure out why
> it is so slow...
>
>
>
> First, here is my test query :
>
>
>
> SELECT CASE WHEN dwh_company_id = 1
>
>
> THEN CASE
>
>
>
>
> WHEN wv.source ~ '^$' THEN 'Not tracked'
>
>
>
> WHEN wv.source ~ '^1$' THEN 'Not tracked1'
>
>
> WHEN wv.source ~ '^2$' THEN 'Not tracked2'
>
>
> WHEN wv.source ~ '^3$' THEN 'Not tracked3'
>
>
> WHEN wv.source ~ '^4$' THEN 'Not tracked4'
>
>
> WHEN wv.source ~ '^5$' THEN 'Not tracked5'
>
>
> WHEN wv.source ~ '^6$' THEN 'Not tracked6'
>
>
> WHEN wv.source ~ '^7$' THEN 'Not tracked7'
>
>
> WHEN wv.source ~ '^8$' THEN 'Not tracked8'
>
>
> WHEN wv.source ~ '^9$' THEN 'Not tracked9'
>
>
> WHEN wv.source ~ '^10$' THEN 'Not tracked10'
>
>
> WHEN wv.source ~ '^11$' THEN 'Not tracked11'
>
>
> WHEN wv.source ~ '^12$' THEN 'Not tracked12'
>
>
> WHEN wv.source ~ '^13$' THEN 'Not tracked13'
>
>
> WHEN wv.source ~ '^14$' THEN 'Not tracked14'
>
>
> WHEN wv.source ~ '^15$' THEN 'Not tracked15'
>
>
> WHEN wv.source ~ '^16$' THEN 'Not tracked16'
>
>
> WHEN wv.source ~ '^17$' THEN 'Not tracked17'
>
>
> WHEN wv.source ~ '^18$' THEN 'Not tracked18'
>
>
> WHEN wv.source ~ '^19$' THEN 'Not tracked19'
>
>
> WHEN wv.source ~ '^20$' THEN 'Not tracked20'
>
>
> WHEN wv.source ~ '^21$' THEN 'Not tracked21'
>
>
> WHEN wv.source ~ '^22$' THEN 'Not tracked22'
>
>
> WHEN wv.source ~ '^23$' THEN 'Not tracked23'
>
>
> WHEN wv.source ~ '^24$' THEN 'Not tracked24'
>
>
> WHEN wv.source ~ '^25$' THEN 'Not tracked25'
>
>
> WHEN wv.source ~ '^26$' THEN 'Not tracked26'
>
>
> WHEN wv.source ~ '^27$' THEN 'Not tracked27'
>
>
> WHEN wv.source ~ '^28$' THEN 'Not tracked28'
>
>
> --WHEN wv.source ~ '^29$' THEN 'Not tracked29'
>
>
> WHEN wv.source ~ '^30$' THEN 'Not tracked30'
>
>
> WHEN wv.source ~ '^31$' THEN 'Not tracked31'
>
>
> WHEN wv.source ~ '^32$' THEN 'Not tracked32'
>
>
> END
>
> ELSE
>
> 'Others'
>
> END as channel
>
> FROM (
>
> SELECT wv.id,
>
> wv.ga_id,
>
> split_part(wv.ga_source_medium, ' /
> ', 1) as source,
>
> ga.dwh_source_id,
>
> s.dwh_company_id
>
> FROM marketing.web_visits wv
>
> INNER JOIN dwh_metadata.google_analytics ga
> ON ga.ga_id = wv.ga_id
>
> INNER JOIN dwh_manager.sources s ON
> ga.dwh_source_id =s.dwh_source_id
>
> --WHERE s.dwh_company_id = 1
>
> LIMIT 100000
>
> ) wv
>
>
>
>
>
> This is a pretty simple case, my subquery (or CTE when using WITH
> statement) should return 5 fields with more or less this structure :
>
> Id : character(32)
>
> Ga_id : bigint
>
> Source : character(32)
>
> Medium : character(32)
>
> dwh_company_id : bigint
>
>
>
> On top of which I apply a case when statement…
>
>
>
> Now the weird thing is, using this query I notice a significant drop in
> performance as the “case when” is getting bigger. If I run the query as if,
> I get the following exec plain and execution time:
>
> Subquery Scan on wv (cost=6.00..29098.17 rows=100000 width=36) (actual
> time=0.828..22476.917 rows=100000 loops=1)
>
> Buffers: shared hit=3136
>
> -> Limit (cost=6.00..11598.17 rows=100000 width=58) (actual
> time=0.209..133.429 rows=100000 loops=1)
>
> Buffers: shared hit=3136
>
> -> Hash Join (cost=6.00..1069811.24 rows=9228690 width=58)
> (actual time=0.208..119.297 rows=100000 loops=1)
>
> Hash Cond: (wv_1.ga_id = ga.ga_id)
>
> Buffers: shared hit=3136
>
> -> Seq Scan on web_visits wv_1 (cost=0.00..877005.78
> rows=20587078 width=50) (actual time=0.004..18.412 rows=100000 loops=1)
>
> Buffers: shared hit=3133
>
> -> Hash (cost=5.50..5.50 rows=40 width=12) (actual
> time=0.184..0.184 rows=111 loops=1)
>
> Buckets: 1024 Batches: 1 Memory Usage: 5kB
>
> Buffers: shared hit=3
>
> -> Hash Join (cost=1.88..5.50 rows=40 width=12)
> (actual time=0.056..0.148 rows=111 loops=1)
>
> Hash Cond: (ga.dwh_source_id = s.dwh_source_id)
>
> Buffers: shared hit=3
>
> -> Seq Scan on google_analytics ga
> (cost=0.00..2.89 rows=89 width=8) (actual time=0.007..0.028 rows=111
> loops=1)
>
> Buffers: shared hit=2
>
> -> Hash (cost=1.39..1.39 rows=39 width=8)
> (actual time=0.042..0.042 rows=56 loops=1)
>
> Buckets: 1024 Batches: 1 Memory Usage:
> 3kB
>
> Buffers: shared hit=1
>
> -> Seq Scan on sources s
> (cost=0.00..1.39 rows=39 width=8) (actual time=0.005..0.020 rows=56
> loops=1)
>
> Buffers: shared hit=1
>
> Planning time: 0.599 ms
>
> Execution time: 22486.216 ms
>
>
>
> Then try commenting out only one line in the case when and the query run
> 10x faster :
>
>
>
> Subquery Scan on wv (cost=6.00..28598.17 rows=100000 width=36) (actual
> time=0.839..2460.002 rows=100000 loops=1)
>
> Buffers: shared hit=3136
>
> -> Limit (cost=6.00..11598.17 rows=100000 width=58) (actual
> time=0.210..112.043 rows=100000 loops=1)
>
> Buffers: shared hit=3136
>
> -> Hash Join (cost=6.00..1069811.24 rows=9228690 width=58)
> (actual time=0.209..99.513 rows=100000 loops=1)
>
> Hash Cond: (wv_1.ga_id = ga.ga_id)
>
> Buffers: shared hit=3136
>
> -> Seq Scan on web_visits wv_1 (cost=0.00..877005.78
> rows=20587078 width=50) (actual time=0.004..14.048 rows=100000 loops=1)
>
> Buffers: shared hit=3133
>
> -> Hash (cost=5.50..5.50 rows=40 width=12) (actual
> time=0.184..0.184 rows=111 loops=1)
>
> Buckets: 1024 Batches: 1 Memory Usage: 5kB
>
> Buffers: shared hit=3
>
> -> Hash Join (cost=1.88..5.50 rows=40 width=12)
> (actual time=0.058..0.146 rows=111 loops=1)
>
> Hash Cond: (ga.dwh_source_id = s.dwh_source_id)
>
> Buffers: shared hit=3
>
> -> Seq Scan on google_analytics ga
> (cost=0.00..2.89 rows=89 width=8) (actual time=0.007..0.025 rows=111
> loops=1)
>
> Buffers: shared hit=2
>
> -> Hash (cost=1.39..1.39 rows=39 width=8)
> (actual time=0.042..0.042 rows=56 loops=1)
>
> Buckets: 1024 Batches: 1 Memory Usage:
> 3kB
>
> Buffers: shared hit=1
>
> -> Seq Scan on sources s
> (cost=0.00..1.39 rows=39 width=8) (actual time=0.006..0.021 rows=56
> loops=1)
>
> Buffers: shared hit=1
>
> Planning time: 0.583 ms
>
> Execution time: 2467.484 ms
>
>
>
> Why this drop in performance for only one (in this simple example)
> condition ? I do not really understand it. If I add more conditions to the
> query (let say 1 or 2) it is also getting slower. And it’s not a few ms, it
> is around 5 sec or so. (which is huge considering I only take in my example
> 1/500 of my data with LIMIT.
>
>
>
> Before we deviate from the problem I have (which is why the sudden drop of
> performance) let me clarify a few things about this query :
>
> - The purpose is not to rewrite it, with a join or whatever, the
> case when actually comes from a function which is auto-generated by another
> app we have
>
> - My example is pretty simple and regex expressions could be
> replaced by equals, the real case when query contains way more complicated
> regex
>
> - This is subset of my CASE WHEN, it is much bigger, I cut it at
> the “bottleneck” point for this post.
>
>
>
> Thanks a lot.
>
>
>
> Best Regards,
>
>
>
> Kevin
>
>
>
>
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2015-03-31 13:58:30 Re: Weird CASE WHEN behaviour causing query to be suddenly very slow
Previous Message Kevin Viraud 2015-03-31 09:19:18 Re: Weird CASE WHEN behaviour causing query to be suddenly very slow