From: | John Cheng <jlcheng(at)ymail(dot)com> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | Problem search on text arrays, using the overlaps (&&) operator |
Date: | 2009-07-02 00:07:15 |
Message-ID: | 534564.57504.qm@web43406.mail.sp1.yahoo.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
We use text[] on one of our tables. This text[] column allows us to
search for records that matches a keyword in a set of keywords. For
example, if we want to find records that has a keyword of "foo" or
"bar", we can use the condition:
keywords && '{foo, bar}'::text[]
Another wau is to do this:
(keywords && '{foo}::text[]' OR keywords && '{bar}::text[]')
I am noticing a big difference between the two ways. I'm trying to
find out if we need to re-write our queries to speed them up, or
perhaps I am just missing something about how to use text[].
To set up a simple test case, use:
CREATE TEMP TABLE foo (
keywords text[]
);
CREATE INDEX foo_idx ON foo USING gin (keywords);
INSERT INTO foo VALUES ('{ford}'::text[]);
INSERT INTO foo VALUES ('{toyota}'::text[]);
INSERT INTO foo VALUES ('{volkswagen}'::text[]);
INSERT INTO foo VALUES ('{dodge}'::text[]);
INSERT INTO foo VALUES ('{saturn}'::text[]);
INSERT INTO foo VALUES ('{honda}'::text[]);
INSERT INTO foo VALUES ('{porsche}'::text[]);
INSERT INTO foo VALUES ('{porsche, audi, chrysler}'::text[]);
INSERT INTO foo VALUES ('{honda, hummer, ferrari}'::text[]);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
-- Query of form 'arr && {foo, bar}'
EXPLAIN ANALYZE SELECT
count(*)
FROM foo
WHERE keywords && '{ford, toyota}'::text[];
------ result ------
QUERY PLAN:
Aggregate (cost=9870.50..9870.51 rows=1 width=0) (actual time=449.937..449.938 rows=1 loops=1)
-> Bitmap Heap Scan on foo (cost=104.88..9853.56 rows=6778 width=0) (actual time=61.197..308.724 rows=262144 loops=1)
Recheck Cond: (keywords && '{ford,toyota}'::text[])
-> Bitmap Index Scan on foo_idx (cost=0.00..103.19 rows=6778 width=0) (actual time=58.816..58.816 rows=262144 loops=1)
Index Cond: (keywords && '{ford,toyota}'::text[])
Total runtime: 450.121 ms
(6 rows)
-- Query of form 'arr && {foo} OR arr && bar'
EXPLAIN ANALYZE SELECT count(*) FROM foo WHERE
(
keywords && '{ford}'::text[]
OR keywords && '{toyota}'::text[]
)
------ result ------
QUERY PLAN:
Aggregate (cost=11351.85..11351.86 rows=1 width=0) (actual time=424.389..424.389 rows=1 loops=1)
-> Bitmap Heap Scan on foo (cost=213.13..11318.04 rows=13522 width=0) (actual time=43.728..273.913 rows=262144 loops=1)
Recheck Cond: ((keywords && '{ford}'::text[]) OR (keywords && '{toyota}'::text[]))
-> BitmapOr (cost=213.13..213.13 rows=13556 width=0) (actual time=41.386..41.386 rows=0 loops=1)
-> Bitmap Index Scan on foo_idx (cost=0.00..103.19 rows=6778 width=0) (actual time=21.216..21.216 rows=131072 loops=1)
Index Cond: (keywords && '{ford}'::text[])
-> Bitmap Index Scan on foo_idx (cost=0.00..103.19 rows=6778 width=0) (actual time=20.167..20.167 rows=131072 loops=1)
Index Cond: (keywords && '{toyota}'::text[])
Total runtime: 424.431 ms
(9 rows)
The difference is very little here. However, in our application I am
seeing a much bigger difference. The affected query is a lot more
complicated:
First, a query of the form "keywords && '{foo, bar}'::text[]"
EXPLAIN ANALYZE SELECT
count(*)
FROM mb_lead ml
INNER JOIN lead_reporting_data lrd ON lrd.lead_id = ml.lead_id
WHERE lrd.typeflags && '{autobytel.volume, automotive}'::text[];
------ result ------
QUERY PLAN:
Aggregate (cost=71602.10..71602.11 rows=1 width=0) (actual time=13196.895..13196.896 rows=1 loops=1)
-> Hash Join (cost=60278.37..71598.14 rows=1582 width=0) (actual time=1924.076..13170.602 rows=29567 loops=1)
Hash Cond: (ml.lead_id = lrd.lead_id)
-> Seq Scan on mb_lead ml (cost=0.00..6557.98 rows=316398 width=8) (actual time=0.014..293.214 rows=316398 loops=1)
-> Hash (cost=60022.57..60022.57 rows=20464 width=8) (actual time=1922.050..1922.050 rows=473743 loops=1)
-> Bitmap Heap Scan on lead_reporting_data lrd (cost=808.14..60022.57 rows=20464 width=8) (actual time=424.841..1276.990 rows=473743 loops=1)
Recheck Cond: (typeflags && '{autobytel.volume,automotive}'::text[])
-> Bitmap Index Scan on lead_reporting_data_typeflags_idx (cost=0.00..803.02 rows=20464 width=0) (actual time=308.941..308.941 rows=483587 loops=1)
Index Cond: (typeflags && '{autobytel.volume,automotive}'::text[])
Total runtime: 13197.015 ms
Second, a query of the form:
"keywords && '{foo}'::text[] OR keywords && '{bar}'::text[]"
EXPLAIN ANALYZE SELECT
count(*)
FROM mb_lead ml
INNER JOIN lead_reporting_data lrd ON lrd.lead_id = ml.lead_id
WHERE (lrd.typeflags && '{autobytel.volume}'::text[]
OR lrd.typeflags && '{automotive}'::text[])
------ result ------
QUERY PLAN:
Aggregate (cost=112761.86..112761.87 rows=1 width=0) (actual time=7768.672..7768.673 rows=1 loops=1)
-> Hash Join (cost=101418.46..112753.97 rows=3156 width=0) (actual time=1850.560..7743.651 rows=29567 loops=1)
Hash Cond: (ml.lead_id = lrd.lead_id)
-> Seq Scan on mb_lead ml (cost=0.00..6557.98 rows=316398 width=8) (actual time=0.013..274.131 rows=316398 loops=1)
-> Hash (cost=100908.13..100908.13 rows=40826 width=8) (actual time=1849.519..1849.519 rows=473743 loops=1)
-> Bitmap Heap Scan on lead_reporting_data lrd (cost=1626.46..100908.13 rows=40826 width=8) (actual time=357.613..1211.535 rows=473743 loops=1)
Recheck Cond: ((typeflags && '{autobytel.volume}'::text[]) OR (typeflags && '{automotive}'::text[]))
-> BitmapOr (cost=1626.46..1626.46 rows=40928 width=0) (actual time=240.921..240.921 rows=0 loops=1)
-> Bitmap Index Scan on lead_reporting_data_typeflags_idx (cost=0.00..803.02 rows=20464 width=0) (actual time=87.368..87.368 rows=161264 loops=1)
Index Cond: (typeflags && '{autobytel.volume}'::text[])
-> Bitmap Index Scan on lead_reporting_data_typeflags_idx (cost=0.00..803.02 rows=20464 width=0) (actual time=153.546..153.546 rows=322317 loops=1)
Index Cond: (typeflags && '{automotive}'::text[])
Total runtime: 7768.788 ms
-----------
For some reason, I am seeing a big difference in our real database. I
don't want to just rewrite all of our queries yet. I'm guessing the
data makes a big difference. What would be a good way to examine the
data to figure out what's the best way to write our queries? Is there
any features in PostgreSQL that can help me improve the performance?
Any advice would be greatly appreciated!
From | Date | Subject | |
---|---|---|---|
Next Message | Greg Stark | 2009-07-02 00:32:40 | Re: 1 Sequence per Row i.e. each customer's first order starts at 1 |
Previous Message | Merrick | 2009-07-02 00:04:33 | 1 Sequence per Row i.e. each customer's first order starts at 1 |