Efficient "filter query" with positive and/or negative conditions

From: Anders Steinlein <anders(at)steinlein(dot)no>
To: pgsql-general(at)postgresql(dot)org
Subject: Efficient "filter query" with positive and/or negative conditions
Date: 2010-03-15 15:14:29
Message-ID: 7DCE73D5-B4D6-4AAB-86F7-900E73254B4C@steinlein.no
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

We have a database with contacts within lists, which can have any
number of tags associated with them. I would like to segment the
contacts based on any number of tag-based conditions, all of which
must be fulfilled (i.e. an ALL rule/filter). For instance, I want to
extract all contacts which have the tags "customer" AND "newsletter"
but NOT the "invited" tag. However, segments need not contain
"positive" predicates, so a segment defined as only NOT "invited" tag
should include ALL contacts except those with the "invited" tag. I
have found a working query using CTE, but it's way too slow when tens
of thousands of contacts are involved. Suggestions for an alternative
query and/or optimizations to this one?

Server:
-------
PostgreSQL 8.4.2 on a shared server running Apache/PHP/MySQL/PostgreSQL
shared_buffers: 128MB
work_mem: 8MB
effective_cache_size: 256MB

Current query:
--------------
WITH segments_contacts_tags AS (
SELECT segmentid, tagname, tagtype, email
FROM segments_tags st
LEFT JOIN contacts_tags ct USING (tagname)
WHERE segmentid = 93
), positive_segment_matches AS (
SELECT segmentid, email, COUNT(email) AS tags
FROM segments_contacts_tags
WHERE tagtype = 1
GROUP BY 1, 2
), positive_predicates_count AS (
SELECT segmentid, COUNT(*) AS count
FROM segments_tags
WHERE tagtype = 1
GROUP BY 1
)
SELECT s.segmentid, cl.email
FROM segments s
INNER JOIN contacts_lists cl USING (lid)
LEFT JOIN positive_segment_matches psm USING (segmentid, email)
LEFT JOIN positive_predicates_count ppc USING (segmentid)
WHERE segmentid = 93
AND ((psm.email IS NULL AND ppc.count IS NULL) OR psm.tags =
ppc.count)
EXCEPT
SELECT segmentid, email
FROM segments_contacts_tags st
WHERE st.tagtype = 0;

Explain analyze:
----------------
HashSetOp Except (cost=1817.12..5822.48 rows=200 width=36) (actual
time=5827.885..6051.771 rows=64672 loops=1)
CTE segments_contacts_tags
-> Nested Loop Left Join (cost=0.00..1323.86 rows=21819
width=61) (actual time=0.143..696.589 rows=64672 loops=1)
-> Seq Scan on segments_tags st (cost=0.00..1.05 rows=1
width=29) (actual time=0.014..0.025 rows=1 loops=1)
Filter: (segmentid = 93)
-> Index Scan using contacts_tags_tagname on
contacts_tags ct (cost=0.00..1050.07 rows=21819 width=49) (actual
time=0.110..265.453 rows=64672 loops=1)
Index Cond: (st.tagname = ct.tagname)
CTE positive_segment_matches
-> HashAggregate (cost=491.75..491.88 rows=11 width=36)
(actual time=1733.143..1995.489 rows=64672 loops=1)
-> CTE Scan on segments_contacts_tags (cost=0.00..490.93
rows=109 width=36) (actual time=0.161..1245.836 rows=64672 loops=1)
Filter: (tagtype = 1)
CTE positive_predicates_count
-> HashAggregate (cost=1.07..1.12 rows=4 width=4) (actual
time=0.069..0.088 rows=4 loops=1)
-> Seq Scan on segments_tags (cost=0.00..1.05 rows=4
width=4) (actual time=0.015..0.032 rows=4 loops=1)
Filter: (tagtype = 1)
-> Append (cost=0.26..3681.56 rows=64811 width=36) (actual
time=2734.790..5363.556 rows=64672 loops=1)
-> Subquery Scan "*SELECT* 1" (cost=0.26..3189.55
rows=64702 width=36) (actual time=2734.782..4923.849 rows=64672 loops=1)
-> Hash Left Join (cost=0.26..2542.53 rows=64702
width=36) (actual time=2734.773..4483.547 rows=64672 loops=1)
Hash Cond: ((s.segmentid = psm.segmentid) AND
((cl.email)::text = (psm.email)::text))
Filter: (((psm.email IS NULL) AND (ppc.count IS
NULL)) OR (psm.tags = ppc.count))
-> Nested Loop (cost=0.00..2052.95 rows=64702
width=44) (actual time=0.169..687.204 rows=64702 loops=1)
Join Filter: (s.lid = cl.lid)
-> Nested Loop Left Join
(cost=0.00..1.15 rows=1 width=16) (actual time=0.130..0.189 rows=1
loops=1)
Join Filter: (s.segmentid =
ppc.segmentid)
-> Seq Scan on segments s
(cost=0.00..1.05 rows=1 width=8) (actual time=0.025..0.032 rows=1
loops=1)
Filter: (segmentid = 93)
-> CTE Scan on
positive_predicates_count ppc (cost=0.00..0.09 rows=1 width=12)
(actual time=0.086..0.129 rows=1 loops=1)
Filter: (ppc.segmentid = 93)
-> Seq Scan on contacts_lists cl
(cost=0.00..1243.02 rows=64702 width=36) (actual time=0.014..231.331
rows=64702 loops=1)
-> Hash (cost=0.25..0.25 rows=1 width=44)
(actual time=2734.535..2734.535 rows=64672 loops=1)
-> CTE Scan on positive_segment_matches
psm (cost=0.00..0.25 rows=1 width=44) (actual time=1733.169..2481.359
rows=64672 loops=1)
Filter: (segmentid = 93)
-> Subquery Scan "*SELECT* 2" (cost=0.00..492.02 rows=109
width=36) (actual time=23.767..23.767 rows=0 loops=1)
-> CTE Scan on segments_contacts_tags st
(cost=0.00..490.93 rows=109 width=36) (actual time=23.758..23.758
rows=0 loops=1)
Filter: (tagtype = 0)
Total runtime: 6278.849 ms

Involved tables:
----------------

Table "public.contacts_lists"
Column | Type | Modifiers
----------------+-----------------------------
+------------------------------
email | email | not null
lid | integer | not null
lstatus | character(1) | not null default
'a'::bpchar
ladded | timestamp without time zone | not null default now()
lstatuschanged | timestamp without time zone | not null default now()
Indexes:
"contacts_lists_pkey" PRIMARY KEY, btree (email, lid)
"contacts_lists_statchanged" btree (lstatuschanged)
"contacts_lists_status" btree (lstatus)
Check constraints:
"status_check" CHECK (lstatus = 'a'::bpchar OR lstatus =
'b'::bpchar OR lstatus = 'u'::bpchar)
Foreign-key constraints:
"contacts_lists_email_fkey" FOREIGN KEY (email) REFERENCES
contacts(email) ON UPDATE CASCADE ON DELETE CASCADE
"contacts_lists_lid_fkey" FOREIGN KEY (lid) REFERENCES lists(lid)
ON UPDATE RESTRICT ON DELETE CASCADE

Table "public.contacts_tags"
Column | Type | Modifiers
---------+-------+-----------
email | email | not null
tagname | text | not null
Indexes:
"contacts_tags_pkey" PRIMARY KEY, btree (email, tagname)
"contacts_tags_tagname" btree (tagname)
Foreign-key constraints:
"contacts_tags_email_fkey" FOREIGN KEY (email) REFERENCES
contacts(email) ON UPDATE CASCADE ON DELETE CASCADE
"contacts_tags_tagname_fkey" FOREIGN KEY (tagname) REFERENCES
tags(tagname) ON UPDATE CASCADE ON DELETE CASCADE

Table "public.tags"
Column | Type | Modifiers
-----------+-----------------------------+------------------------
tagname | text | not null
createdat | timestamp without time zone | not null default now()
Indexes:
"tags_pkey" PRIMARY KEY, btree (tagname)
Referenced by:
TABLE "contacts_tags" CONSTRAINT "contacts_tags_tagname_fkey"
FOREIGN KEY (tagname) REFERENCES tags(tagname) ON UPDATE CASCADE ON
DELETE CASCADE
TABLE "segments_tags" CONSTRAINT "segments_tags_tagname_fkey"
FOREIGN KEY (tagname) REFERENCES tags(tagname) ON UPDATE CASCADE ON
DELETE CASCADE

Table "public.segments"
Column | Type |
Modifiers
-------------+-----------------------------
+--------------------------------------------------------------
segmentid | integer | not null default
nextval('segments_segmentid_seq'::regclass)
lid | integer | not null
segmentname | text | not null
createdat | timestamp without time zone | not null default now()
Indexes:
"segments_pkey" PRIMARY KEY, btree (segmentid)
Foreign-key constraints:
"segments_lid_fkey" FOREIGN KEY (lid) REFERENCES lists(lid) ON
UPDATE RESTRICT ON DELETE CASCADE
Referenced by:
TABLE "segments_tags" CONSTRAINT "segments_tags_segmentid_fkey"
FOREIGN KEY (segmentid) REFERENCES segments(segmentid) ON UPDATE
RESTRICT ON DELETE CASCADE

Table "public.segments_tags"
Column | Type | Modifiers
-----------+---------+-----------
segmentid | integer | not null
tagname | text | not null
tagtype | integer | not null
Indexes:
"segments_tags_pkey" PRIMARY KEY, btree (segmentid, tagname)
Foreign-key constraints:
"segments_tags_segmentid_fkey" FOREIGN KEY (segmentid) REFERENCES
segments(segmentid) ON UPDATE RESTRICT ON DELETE CASCADE
"segments_tags_tagname_fkey" FOREIGN KEY (tagname) REFERENCES
tags(tagname) ON UPDATE CASCADE ON DELETE CASCADE

-- anders

Browse pgsql-general by date

  From Date Subject
Next Message Frank jansen 2010-03-15 15:22:33 Move to core xml from contrib/xml2
Previous Message Vick Khera 2010-03-15 15:09:17 Re: libpq: compatibility with server versions