From: | "Jerry Wintrode" <wintrojr(at)tripos(dot)com> |
---|---|
To: | <pgsql-sql(at)postgresql(dot)org> |
Subject: | Need another way to do this, my sql much to slow... |
Date: | 2003-11-20 14:26:18 |
Message-ID: | 4E676B0AAF74B443A18D7BC7AAB3CFFD1D44D6@s01-exch01.tripos.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-sql |
I am attempting to write a spam management add-on for my corporate
server, I have the database and the apps written but one part is WAY to
slow to be usefull.
The following view:
Column | Type | Modifiers
-----------------------+---------+-----------
env_sender_num | integer |
msg_from_claims_count | bigint |
is built on the follow table:
Column | Type | Modifiers
-----------------+-----------------------------+-----------
msg_sender_num | integer |
env_sender_num | integer |
msg_from_domain | character varying(255) |
env_from_domain | character varying(255) |
recorded_date | timestamp without time zone |
Indexes: record_of_claims_env_sender_num_idx btree (env_sender_num),
record_of_claims_msg_sender_num_idx btree (msg_sender_num)
With the following SQL:
SELECT record_of_claims.env_sender_num, count(DISTINCT
record_of_claims.msg_sender_num)
AS msg_from_claims_count
FROM record_of_claims
GROUP BY record_of_claims.env_sender_num;
A sample dataset follows:
msg_sender_num | env_sender_num | msg_from_domain |
env_from_domain | recorded_date
----------------+----------------+-----------------------+--------------
---------+---------------------
1 | 1 | yahoo.com | yahoo.com
| 2003-11-18 13:21:07
2 | 2 | mx128.optinvc13y.com |
mx128.optinvc13y.com | 2003-11-18 13:21:16
3 | 3 | yahoo.com | yahoo.com
| 2003-11-18 13:21:17
4 | 4 | yahoo.com | yahoo.com
| 2003-11-18 13:21:21
5 | 5 | biomarketgroup.com |
biomarketgroup.com | 2003-11-18 13:21:24
6 | 6 | sohu.com | sohu.com
| 2003-11-18 13:21:28
7 | 7 | lycos.com | lycos.com
| 2003-11-18 13:21:38
8 | 8 | mail.expressrx.info |
mail.expressrx.info | 2003-11-18 13:21:41
9 | 9 | approveddeals.com |
approveddeals.com | 2003-11-18 13:21:41
10 | 10 | conceptholidays.co.uk |
conceptholidays.co.uk | 2003-11-18 13:21:48
The msg_sender_num and env_sender_num come from another table of unique
names of senders. What I am attempting to do is see how many times
msg_sender_num 1 claims to be a different env_sender_num. So I have to
find all the entries in msg_sender_num equal to 1 and build a count of
the distinct numbers in env_sender_num. This number is then used later
to say that if a msg_sender_num claims to be more then 2
env_sender_num's then the sender is a spammer and gets added to a list.
Everything is working fine except the SQL above. It takes WAY to long to
process on a 500000+ record database. Hell it takes 12 seconds or so on
a 50000 record database. I have included the query plan to show that the
indexes are being used.
Query Plan:
Aggregate (cost=0.00..166.16 rows=264 width=8) (actual
time=0.98..7768.19 rows=62911 loops=1)
-> Group (cost=0.00..159.57 rows=2635 width=8) (actual
time=0.56..3179.14 rows=80466 loops=1)
-> Index Scan using record_of_claims_env_sender_num_idx on
record_of_claims (cost=0.00..152.99 rows=2635 width=8) (actual
time=0.55..2240.15 rows=80466 loops=1)
Total runtime: 7931.63 msec
Is there a better, read "Faster", way to achieve this?
Jerry Wintrode
Very Newbie Postgres User
From | Date | Subject | |
---|---|---|---|
Next Message | Christoph Haller | 2003-11-20 15:38:14 | Re: Need another way to do this, my sql much to slow... |
Previous Message | Erik Thiele | 2003-11-20 06:31:31 | current_date timezone documentation suggestion |