From: | Jeff Frost <jeff(at)frostconsultingllc(dot)com> |
---|---|
To: | pgsql-sql(at)postgresql(dot)org |
Subject: | Re: separating improperly grouped page views |
Date: | 2007-07-05 17:20:06 |
Message-ID: | Pine.LNX.4.64.0707051007410.31414@discord.home.frostconsultingllc.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-sql |
On Sun, 17 Jun 2007, Jeff Frost wrote:
> On Mon, 18 Jun 2007, Tom Lane wrote:
>
>> Jeff Frost <jeff(at)frostconsultingllc(dot)com> writes:
>>> [ umpteen million iterations of: ]
>>> -> Limit (cost=0.00..367.09 rows=1 width=8)
>>> -> Index Scan Backward using page_view_stamp_idx on
>>> page_view pv2 (cost=0.00..158215.86 rows=431 width=8)
>>> Index Cond: (stamp < $1)
>>> Filter: ((stamp IS NOT NULL) AND (visit_id =
>>> $0))
>>
>> Perhaps an index on (visit_id, stamp) would help. This one is doing the
>> best it can, but if the visit_id's you want are thinly scattered, it'll
>> still suck...
>
> Good idea Tom! In fact the planner seems to like that much better:
>
> Seq Scan on page_view pv1 (cost=0.00..11529031.34 rows=3580205 width=239)
> vs
> Seq Scan on page_view pv1 (cost=0.00..2622541458.55 rows=3596473 width=237)
Now that this is working in a reasonable amount of time to process the entire
data set, I need to work out how to process the new information that comes in
every so often and still assign it a correct visit_id until the fix for this
can get through QA. The cleanup query looks like this:
CREATE TEMP TABLE tmpmaxpvid AS SELECT MAX(id) AS id FROM
reporting.page_view_clean;
INSERT INTO reporting.page_view_clean
SELECT id
,
CASE
WHEN stamp - last_stamp > INTERVAL '1 hour'
OR last_stamp IS NULL THEN nextval('reporting.tracking_cleanup_seq')
ELSE currval('reporting.tracking_cleanup_seq')
END AS visit_id
, uri
, params
, stamp
, visit_id AS old_visit_id
FROM (
SELECT id
, visit_id
, uri
, params
, stamp
, (
SELECT MAX(pv2.stamp)
FROM page_view pv2
WHERE pv2.visit_id = pv1.visit_id
AND pv2.stamp < pv1.stamp
) AS last_stamp
FROM page_view pv1
WHERE pv1.stamp < now() - INTERVAL '1 hour'
AND pv1.id > (
SELECT CASE
WHEN id IS NULL THEN 0
ELSE id
END AS id
FROM tmpmaxpvid
)
ORDER BY pv1.visit_id
, pv1.stamp
) x;
The problem is the page_views that straddle the time before and after the
query get an artificially different visit_id.
This case statement is what kills me:
CASE
WHEN stamp - last_stamp > INTERVAL '1 hour'
OR last_stamp IS NULL THEN nextval('reporting.tracking_cleanup_seq')
ELSE currval('reporting.tracking_cleanup_seq')
END AS visit_id
If I change it to this:
CASE
WHEN last_stamp IS NULL
THEN visit_id
WHEN stamp - last_stamp > INTERVAL '1 hour'
THEN nextval('reporting.tracking_cleanup_seq')
ELSE currval('reporting.tracking_cleanup_seq')
END AS visit_id
Then it gives the first page_view in the sequence the correct visit_id, but
then continues on with a different visit_id for the rest.
I've also tried changing how I select which data to act on like so:
WHERE pv1.id > (
SELECT CASE
WHEN id IS NULL THEN 0
ELSE id
END AS id
FROM tmpmaxpvid
)
AND pv1.visit_id IN (
SELECT visit_id
FROM page_view pv3
GROUP BY visit_id
HAVING max(stamp) < now() - INTERVAL '1 hour'
)
ORDER BY pv1.visit_id
, pv1.stamp
) x;
But that leaves me skipping some page views because they haven't reached their
conclusion and because of the id > clause, I'll never go back to see them.
Anyone have any suggestions?
--
Jeff Frost, Owner <jeff(at)frostconsultingllc(dot)com>
Frost Consulting, LLC http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954
From | Date | Subject | |
---|---|---|---|
Next Message | Ranieri Mazili | 2007-07-05 19:00:40 | Working with dates |
Previous Message | Michael Holzman | 2007-07-05 08:36:42 | Re: 'comp.databases.theory' |