Query performance

From: Joe Van Dyk <joe(at)tanga(dot)com>
To: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Query performance
Date: 2015-01-25 05:41:13
Message-ID: CACfv+pKyJWGcU9DjmL3QAkUcb2cUHb-0sqODmjFt-hC9=k2+Kg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I have an events table that records page views and purchases (type =
'viewed' or type='purchased'). I have a query that figures out "people who
bought/viewed this also bought/viewed that".

It worked fine, taking about 0.1 seconds to complete, until a few hours ago
when it started taking hours to complete. Vacuum/analyze didn't help.
Turned out there was one session_id that had 400k rows in the system.
Deleting that made the query performant again.

Is there anything I can do to make the query work better in cases like
that? Missing index, or better query?

This is on 9.3.5.

The below is reproduced at the following URL if it's not formatted
correctly in the email.
https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt

explain select
e1.product_id,
e2.site_id,
e2.product_id,
count(nullif(e2.type='viewed', false)) view_count,
count(nullif(e2.type='purchased', false)) purchase_count
from events e1
join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
where
e1.product_id = '82503' and
e1.product_id != e2.product_id
group by e1.product_id, e2.product_id, e2.site_id;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=828395.67..945838.90 rows=22110 width=19)
-> Sort (cost=828395.67..840117.89 rows=4688885 width=19)
Sort Key: e1.product_id, e2.product_id, e2.site_id
-> Nested Loop (cost=11.85..20371.14 rows=4688885 width=19)
-> Bitmap Heap Scan on events e1 (cost=11.29..1404.31
rows=369 width=49)
Recheck Cond: (product_id = '82503'::citext)
-> Bitmap Index Scan on
events_product_id_site_id_idx (cost=0.00..11.20 rows=369 width=0)
Index Cond: (product_id = '82503'::citext)
-> Index Scan using
events_session_id_type_product_id_idx on events e2 (cost=0.56..51.28
rows=12 width=51)
Index Cond: ((session_id = e1.session_id) AND
(type = e1.type))
Filter: (e1.product_id <> product_id)
(11 rows)

recommender_production=> \d events
Table "public.events"
Column | Type | Modifiers
-------------+--------------------------+-----------------------------------------------------
id | bigint | not null default
nextval('events_id_seq'::regclass)
user_id | citext |
session_id | citext | not null
product_id | citext | not null
site_id | citext | not null
type | text | not null
happened_at | timestamp with time zone | not null
created_at | timestamp with time zone | not null
Indexes:
"events_pkey" PRIMARY KEY, btree (id)
"events_product_id_site_id_idx" btree (product_id, site_id)
"events_session_id_type_product_id_idx" btree (session_id, type, product_id)
Check constraints:
"events_session_id_check" CHECK (length(session_id::text) < 255)
"events_type_check" CHECK (type = ANY (ARRAY['purchased'::text,
'viewed'::text]))
"events_user_id_check" CHECK (length(user_id::text) < 255)

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Joe Van Dyk 2015-01-25 05:43:17 Re: Query performance
Previous Message Tom Lane 2015-01-25 05:14:41 Re: How to tell ANALYZE to collect statistics from the whole table?