From: | Sean Davis <sdavis2(at)mail(dot)nih(dot)gov> |
---|---|
To: | Dracula 007 <dracula007(at)atlas(dot)cz> |
Cc: | PostgreSQL SQL <pgsql-sql(at)postgresql(dot)org> |
Subject: | Re: how to speed up these queries ? |
Date: | 2005-03-03 13:08:42 |
Message-ID: | 2d042e6545ef2506cdced66867172c57@mail.nih.gov |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-sql |
On Mar 3, 2005, at 6:05 AM, Dracula 007 wrote:
> Hello,
>
> I have two "large" tables - "sessions" (about 1.500.000 rows) and
> "actions" (about 4.000.000 rows), and the "actions" table is connected
> to the "sessions" (it contains a key from it). The simplified structure
> of these tables is
>
> sessions (
> session_id int4,
> visitor_id int4,
> session_ip inet,
> session_date timestamp
> )
>
> actions (
> action_id int4,
> session_id int4, -- foreign key, references sessions(session_id)
> action_date timestamp,
> action_year int2,
> action_month int2,
> action_day int2
> )
>
> I run SQL queries like
>
> SELECT
> COUNT(actions.session_id) AS sessions_count,
> COUNT(DISTINCT visitor_id) AS visitors_count,
> COUNT(DISTINCT session_ip) AS ips_count
> FROM actions LEFT JOIN sessions USING (session_id)
> GROUP BY action_year, action_month, action_day
>
> but it's really really slow. I've tried to use different indexes on
> different columns, but no matter what I've tried I can't get it
> faster. The explain analyze of the query is
>
> --------------------------------------------------
> Aggregate (cost=347276.05..347276.05 rows=1 width=23) (actual
> time=210060.349..210060.350 rows=1 loops=1)
> -> Hash Left Join (cost=59337.55..305075.27 rows=4220077
> width=23) (actualtime=24202.338..119924.254 rows=4220077 loops=1)
> Hash Cond: ("outer".session_id = "inner".session_id)
> -> Seq Scan on actions (cost=0.00..114714.77 rows=4220077
> width=8) (actual time=7539.653..44585.023 rows=4220077 loops=1)
> -> Hash (cost=47650.64..47650.64 rows=1484764 width=19)
> (actual time=16628.790..16628.790 rows=0 loops=1)
> -> Seq Scan on sessions (cost=0.00..47650.64
> rows=1484764 width=19) (actual time=0.041..13378.667 rows=1484764
> loops=1)
It looks like you are going to always do a sequential scan on the
tables, as you always look a the entire table(s). How often do you do
the query as compared to the load on the database? If you do the query
often relative to the load, could you keep a table of counts something
like:
create table summarize_use (
action_date date,
sessions_count int,
visitors_count int,
isp_count int)
and then use triggers from the sessions and actions to increment the
various counts in the summarize_use table based on the action_date and
session_date date parts? The summarize_use table would then look like:
action_date sessions_count visitors_count ips_count
3-2-2005 15 12 12
Just a thought, and of course you would pay the price of triggers with
each insert to the sessions or actions table.
If you don't like that and you run this say every night at midnight,
you could set up a cron job that ran this query and selected it into a
table for direct querying--call this table summarize_use again. Then,
to get the full result, you would need to simply do something like:
select * from summarize_use
union
SELECT
COUNT(actions.session_id) AS sessions_count,
COUNT(DISTINCT visitor_id) AS visitors_count,
COUNT(DISTINCT session_ip) AS ips_count
FROM actions LEFT JOIN sessions USING (session_id)
WHERE
action_year=2005 AND
action_month=3 AND
action_day=3;
This would then be fully up-to-date and would use indices on
action_year, action_month, action_day. DISCLAIMER--All of this is
untested....
Sean.
From | Date | Subject | |
---|---|---|---|
Next Message | Sean Davis | 2005-03-03 13:23:55 | Re: Building a database from a flat file |
Previous Message | Markus Schaber | 2005-03-03 12:17:25 | Re: Multiples schemas |