Re: Query performance

From: Joe Van Dyk <joe(at)tanga(dot)com>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Query performance
Date: 2015-01-25 07:20:59
Message-ID: CACfv+pJAMmk7kTFnCU6gVCZY4Q5Fz71T+o_CYw7Zha4v-pSO+w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Sat, Jan 24, 2015 at 11:14 PM, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
wrote:

>
>
> 2015-01-25 7:38 GMT+01:00 Joe Van Dyk <joe(at)tanga(dot)com>:
>
>>
>>
>> On Sat, Jan 24, 2015 at 10:12 PM, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
>> wrote:
>>
>>> Hi
>>>
>>> this plan looks well
>>>
>>> Regards
>>>
>>> Pavel
>>>
>>
>> Here's one that's not quite as well: http://explain.depesz.com/s/SgT
>>
>
> I see a possible issue
>
> (product_id <> '81716'::citext) .. this operation is CPU expensive and
> maybe nonsense
>
> product_id should be integer -- and if it isn't - it should not be on 4M
> rows extremly fast - mainly on citext
>
> try to force a opposite cast - you will safe a case insensitive text
> comparation
>
> product_id::int <> 81716
>

It might not always be an integer, just happens to be so here. Should I try
text instead? I don't have to have the case-insensitive matching.

Joe

>
> Regards
>
> Pavel
>
>
>
>
>>
>> Joe
>>
>>
>>>
>>> 2015-01-25 6:45 GMT+01:00 Joe Van Dyk <joe(at)tanga(dot)com>:
>>>
>>>> Oops, didn't run vacuum analyze after deleting the events. Here is
>>>> another 'explain analyze': http://explain.depesz.com/s/AviN
>>>>
>>>> On Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk <joe(at)tanga(dot)com> wrote:
>>>>
>>>>> On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk <joe(at)tanga(dot)com> wrote:
>>>>>
>>>>>> I have an events table that records page views and purchases (type =
>>>>>> 'viewed' or type='purchased'). I have a query that figures out "people who
>>>>>> bought/viewed this also bought/viewed that".
>>>>>>
>>>>>> It worked fine, taking about 0.1 seconds to complete, until a few
>>>>>> hours ago when it started taking hours to complete. Vacuum/analyze didn't
>>>>>> help. Turned out there was one session_id that had 400k rows in the
>>>>>> system. Deleting that made the query performant again.
>>>>>>
>>>>>> Is there anything I can do to make the query work better in cases
>>>>>> like that? Missing index, or better query?
>>>>>>
>>>>>> This is on 9.3.5.
>>>>>>
>>>>>> The below is reproduced at the following URL if it's not formatted
>>>>>> correctly in the email.
>>>>>> https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt
>>>>>>
>>>>>> explain select
>>>>>> e1.product_id,
>>>>>> e2.site_id,
>>>>>> e2.product_id,
>>>>>> count(nullif(e2.type='viewed', false)) view_count,
>>>>>> count(nullif(e2.type='purchased', false)) purchase_count
>>>>>> from events e1
>>>>>> join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
>>>>>> where
>>>>>> e1.product_id = '82503' and
>>>>>> e1.product_id != e2.product_id
>>>>>> group by e1.product_id, e2.product_id, e2.site_id;
>>>>>> QUERY PLAN
>>>>>> ----------------------------------------------------------------------------------------------------------------------------
>>>>>> GroupAggregate (cost=828395.67..945838.90 rows=22110 width=19)
>>>>>> -> Sort (cost=828395.67..840117.89 rows=4688885 width=19)
>>>>>> Sort Key: e1.product_id, e2.product_id, e2.site_id
>>>>>> -> Nested Loop (cost=11.85..20371.14 rows=4688885 width=19)
>>>>>> -> Bitmap Heap Scan on events e1 (cost=11.29..1404.31 rows=369 width=49)
>>>>>> Recheck Cond: (product_id = '82503'::citext)
>>>>>> -> Bitmap Index Scan on events_product_id_site_id_idx (cost=0.00..11.20 rows=369 width=0)
>>>>>> Index Cond: (product_id = '82503'::citext)
>>>>>> -> Index Scan using events_session_id_type_product_id_idx on events e2 (cost=0.56..51.28 rows=12 width=51)
>>>>>> Index Cond: ((session_id = e1.session_id) AND (type = e1.type))
>>>>>> Filter: (e1.product_id <> product_id)
>>>>>> (11 rows)
>>>>>>
>>>>>> recommender_production=> \d events
>>>>>> Table "public.events"
>>>>>> Column | Type | Modifiers
>>>>>> -------------+--------------------------+-----------------------------------------------------
>>>>>> id | bigint | not null default nextval('events_id_seq'::regclass)
>>>>>> user_id | citext |
>>>>>> session_id | citext | not null
>>>>>> product_id | citext | not null
>>>>>> site_id | citext | not null
>>>>>> type | text | not null
>>>>>> happened_at | timestamp with time zone | not null
>>>>>> created_at | timestamp with time zone | not null
>>>>>> Indexes:
>>>>>> "events_pkey" PRIMARY KEY, btree (id)
>>>>>> "events_product_id_site_id_idx" btree (product_id, site_id)
>>>>>> "events_session_id_type_product_id_idx" btree (session_id, type, product_id)
>>>>>> Check constraints:
>>>>>> "events_session_id_check" CHECK (length(session_id::text) < 255)
>>>>>> "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 'viewed'::text]))
>>>>>> "events_user_id_check" CHECK (length(user_id::text) < 255)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> After removing the session with 400k events, I was able to do an
>>>>> explain analyze, here is one of them:
>>>>> http://explain.depesz.com/s/PFNk
>>>>>
>>>>
>>>>
>>>
>>
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Pavel Stehule 2015-01-25 08:03:27 Re: Query performance
Previous Message Pavel Stehule 2015-01-25 07:14:08 Re: Query performance