separating improperly grouped page views

From: Jeff Frost <jeff(at)frostconsultingllc(dot)com>
To: pgsql-sql(at)postgresql(dot)org
Subject: separating improperly grouped page views
Date: 2007-06-18 05:16:36
Message-ID: Pine.LNX.4.64.0706172157410.14506@discord.home.frostconsultingllc.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

It seems a user visit cookie's expiration has been improperly set and so the
tables which represent a user's clickstream through the website are grouped
improperly. I'd like to run a set-wise query to regroup them based on periods
of inactivity greather than 1 hour. So, what we want to do is adjust the
visit_id's in the page_view table to point to separate visits whenever the
user has been idle for more than 1 hour. The query I've come up with looks
like this:

CREATE TABLE visit_clean AS
SELECT id
,
CASE
WHEN stamp - last_stamp > INTERVAL '1 hour'
OR last_stamp IS NULL THEN nextval('tracking_cleanup_seq')
ELSE currval('tracking_cleanup_seq')
END AS visit_id
, visit_id AS old_visit_id
, uri
, params
, stamp
, cindex
, tindex
, method
, source_address
, server_name
FROM (
SELECT id
, visit_id
, uri
, params
, stamp
, cindex
, tindex
, method
, source_address
, (
SELECT max(pv2.stamp)
FROM page_view pv2
WHERE pv2.visit_id = pv1.visit_id
AND pv2.stamp < pv1.stamp
) AS last_stamp
, server_name
FROM page_view pv1
) x;

It works reasonably well on a small data set. But, the plan for this is
horrible on the real data.

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Seq Scan on page_view pv1 (cost=0.00..2634763281.70 rows=3588374 width=239)
SubPlan
-> Result (cost=367.09..367.10 rows=1 width=0)
InitPlan
-> Limit (cost=0.00..367.09 rows=1 width=8)
-> Index Scan Backward using page_view_stamp_idx on page_view pv2 (cost=0.00..158215.86 rows=431 width=8)
Index Cond: (stamp < $1)
Filter: ((stamp IS NOT NULL) AND (visit_id = $0))
-> Result (cost=367.09..367.10 rows=1 width=0)
InitPlan
-> Limit (cost=0.00..367.09 rows=1 width=8)
-> Index Scan Backward using page_view_stamp_idx on page_view pv2 (cost=0.00..158215.86 rows=431 width=8)
Index Cond: (stamp < $1)
Filter: ((stamp IS NOT NULL) AND (visit_id = $0))

The page_view table is 829MB in size and has 3,590,185 rows.

I let the query run on my test server for about 26hrs without finishing, so I
can't provide explain analyze output yet.

Does anyone have a better method of separating this data out?

---
Jeff Frost, Owner <jeff(at)frostconsultingllc(dot)com>
Frost Consulting, LLC http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Tom Lane 2007-06-18 05:54:53 Re: separating improperly grouped page views
Previous Message Andrew Sullivan 2007-06-17 15:54:41 Re: joining a table whose name is stored in the primary record