Re: Problem search on text arrays, using the overlaps (&&) operator

From: nha <lyondif02(at)free(dot)fr>
To: John Cheng <jlcheng(at)ymail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Problem search on text arrays, using the overlaps (&&) operator
Date: 2009-07-08 11:13:50
Message-ID: 4A547F6E.4030705@free.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello,

Le 8/07/09 0:52, John Cheng a écrit :
> I don't mean to be pesky. I was just wondering if there is anything
> else I should try?
>
> Should I simply rewrite all queries, change the form
>
> WHERE textarr && '{foo, bar}'::text[]
>
> To
>
> WHERE (textarr && '{foo}'::text[]
> OR textarr && '{bar}'::text[])
>
> ?
>

While still puzzled by the big runtime difference you report between the
2 condition forms, I went on assessing these runtimes on my side from
the new case statements that are assumed to reflect more the real world.

Here are some measure results I got: (sorry for this long table)

seq style runtime
--- ----- -------
(db=slf)
N01 OR-EA 6 237
N02 CC-EA 5 250
N03 OR+EA 12 628
N04 CC+EA 12 700
N05 OR+EA 15 679
N06 CC+EA 11 510
N07 CC-EA 7 712
N08 OR-EA 8 741
N09 CC-EA 4 963
N10 OR-EA 6 499
(db=stg)
N11 CC+EA 15 978
N12 OR+EA 15 350
N13 CC-EA 8 102
N14 OR-EA 9 428
N15 OR-EA 5 267
N16 CC-EA 5 017
N17 OR-EA 6 119
N18 CC-EA 4 955
N19 OR+EA 11 722
N20 CC+EA 11 532
N21 OR-EA 7 303
N22 CC-EA 5 438
N23 CC-EA 5 519
N24 OR-EA 5 373
N25 OR-EA 5 422
N26 CC-EA 5 064
(db=stg)
N27 CC-EA 8 314
(db=slf)
N28 OR-EA 6 656
(db=stg)
N29 OR-EA 6 760
(db=slf)
N30 CC-EA 6 753
(db=stg)
N31 CC-EA 5 500
(db=slf)
N32 OR-EA 5 907
(db=stg)
N33 OR-EA 5 391
(db=slf)
N34 CC-EA 5 517
--- ----- ----------

Legend
------
seq: sequence order.
style: condition style of query.
CC: style "arr&&{f,b}" (one clause with multi-value text table).
OR: style "arr&&{f} or arr&&{b}" (many clauses with 1-value text table).
OR2: same style as style OR, with explicit JOIN in query expression.
+EA: performed with EXPLAIN ANALYZE on query. Slower.
-EA: performed without EXPLAIN ANALYZE on query. Faster.
runtime: run time in milliseconds.
(db=?): indicates that the following sequences have been performed after
a drop-and-create process for all the tables and indexes.
------

Results from 2 selected EXPLAIN ANALYZE sequences:

-- seq 03 (OR+EA)
Aggregate (cost=37630.52..37630.53 rows=1 width=0) (actual
time=12628.182..12628.184 rows=1 loops=1)
-> Hash Join (cost=25989.12..37601.04 rows=11792 width=0) (actual
time=8796.002..12231.422 rows=300000 loops=1)
Hash Cond: ((bar.id)::numeric = foo.bar_id)
-> Seq Scan on bar (cost=0.00..4328.00 rows=300000 width=4)
(actual time=0.025..402.458 rows=300000 loops=1)
-> Hash (cost=24636.81..24636.81 rows=82425 width=8) (actual
time=8795.027..8795.027 rows=2097152 loops=1)
-> Bitmap Heap Scan on foo (cost=1565.44..24636.81
rows=82425 width=8) (actual time=670.248..5098.109 rows=2097152 loops=1)
Recheck Cond: ((keywords && '{ford}'::text[]) OR
(keywords && '{toyota}'::text[]) OR (keywords && '{volkswagen}'::text[])
OR (keywords && '{saturn}'::text[]) OR (keywords && '{honda}'::text[])
OR (keywords && '{porsche}'::text[]) OR (keywords && '{hummer}'::text[])
OR (keywords && '{ferrari}'::text[]))
-> BitmapOr (cost=1565.44..1565.44 rows=83879
width=0) (actual time=665.516..665.516 rows=0 loops=1)
-> Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=114.013..114.013
rows=262144 loops=1)
Index Cond: (keywords && '{ford}'::text[])
-> Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=72.398..72.398
rows=262144 loops=1)
Index Cond: (keywords && '{toyota}'::text[])
-> Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=74.118..74.118
rows=262144 loops=1)
Index Cond: (keywords &&
'{volkswagen}'::text[])
-> Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=58.486..58.486
rows=262144 loops=1)
Index Cond: (keywords && '{saturn}'::text[])
-> Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=114.671..114.671
rows=524288 loops=1)
Index Cond: (keywords && '{honda}'::text[])
-> Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=115.290..115.290
rows=524288 loops=1)
Index Cond: (keywords &&
'{porsche}'::text[])
-> Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=58.184..58.184
rows=262144 loops=1)
Index Cond: (keywords && '{hummer}'::text[])
-> Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=58.336..58.336
rows=262144 loops=1)
Index Cond: (keywords &&
'{ferrari}'::text[])
Total runtime: 12628.401 ms
-- seq 03 (OR+EA)

-- seq 04 (CC+EA)
Aggregate (cost=26726.37..26726.38 rows=1 width=0) (actual
time=12700.620..12700.621 rows=1 loops=1)
-> Hash Join (cost=17879.62..26722.62 rows=1500 width=0) (actual
time=8482.572..12272.449 rows=300000 loops=1)
Hash Cond: ((bar.id)::numeric = foo.bar_id)
-> Seq Scan on bar (cost=0.00..4328.00 rows=300000 width=4)
(actual time=0.029..412.984 rows=300000 loops=1)
-> Hash (cost=17748.56..17748.56 rows=10485 width=8) (actual
time=8481.524..8481.524 rows=2097152 loops=1)
-> Bitmap Heap Scan on foo (cost=177.69..17748.56
rows=10485 width=8) (actual time=978.464..4679.954 rows=2097152 loops=1)
Recheck Cond: (keywords &&
'{ford,toyota,volkswagen,saturn,honda,porsche,hummer,ferrari}'::text[])
-> Bitmap Index Scan on foo_idx (cost=0.00..175.07
rows=10485 width=0) (actual time=973.569..973.569 rows=2097152 loops=1)
Index Cond: (keywords &&
'{ford,toyota,volkswagen,saturn,honda,porsche,hummer,ferrari}'::text[])
Total runtime: 12700.838 ms
-- seq 04 (CC+EA)

As far as I tried, the runtime difference between the 2 forms varies up
to 10% and always under 1 second--ie. far away from the 100% and (15-8=)
7 seconds reported. Moreover the form "arr&&{f,b}" (let us call it: form
1) often shows better than the other (let us call it: form 2)--already
noticed this fact by previous measures without joins.

Exception occurs in a special case: for the first query submitted
immediately after populating tables. It seems as if indexes were not
optimized at this time. But all the following queries, whatever form is
considered, run in lesser and similar time.

When observing the runtime of intermediate steps (bitmap heap scan on
table foo [step 1], sequential scan on table bar [step 2], hash join
[step 3]), it can be noted that all 3 steps run in a comparable time
whatever form used:
- no relevant difference for steps 2 and 3, but an attended long time
for retrieving all the resulting rows because of their high number;
- for step 1: form 1 (arr&&{f,b}) needs more time to extract the 1st row
of the resultset but is faster to collect the last rows. So no
significant difference may be also revealed here.

As I mentioned, I noticed that the runtime is longer for the 1st query
performed after populating tables and that, since the 2nd query
performed, runtimes clearly decrease towards a mean (about 6 seconds on
my side). Maybe the runtimes reported (15 and 8 seconds respectively)
occur in such a case (just after populating tables) with a performing
order starting with form 1. As much as I saw, the sequence order acts
upon the runtime reduction. Moreover successive queries tend to mean and
equal runtimes with the 2 forms.

About the user complaint, I would be not so astonished because of the
duration of one query (more than 5 seconds in the 2 cases). There are so
many rows to select that it seems not really possible to get a lesser
runtime.

What I may conclude: query plans show difference in performing the
queries: each Where clause is assessed before a commun heap scan of
table foo. The Where clause of the form 1 (multi-value text table) is
longer to run because of multi-value to test for each key; the Where
clause of the form 2 (1-value text table) is faster but the addition of
the corresponding 1-value text table Where clauses runs in a comparable
time. This observation resulted from measures on my sides and may not
correctly reflect some proper configurations of the real production
environment.

Opting to form 2 (arr&&{f} or arr&&{b}) will surely not make runtime
worst. So it would not be a bad choice to implement it if convinced.

Another way could concern the hash join. It has been shown that this
step costs a lot with respect to the overall runtime. Depending on
available storage space and DBMS load, a kind of materialized view may
be handled in order to cut off the overloading join. Here are some
suggested statements to create this helper table:

-- Creates a table jfoobar wrt. tables foo and bar
-- (max 10 seconds elapsed on my platform)
CREATE OR REPLACE TABLE jfoobar AS
SELECT
foo.bar_id, foo.keywords
FROM foo
INNER JOIN bar ON foo.bar_id = bar.id;
-- Creates an index for jfoobar
-- (max 3 seconds elapsed on my platform)
-- this way...
CREATE INDEX jfoobar_idx ON jfoobar(bar_id, keywords);
-- or this one...
DROP INDEX jfoobar_idx;
CREATE INDEX jfoobar_idx ON jfoobar USING gin(keywords);
--

This table may be updated or replaced on a regular time base or by
triggered event on updating foo or bar tables.

Finally, any query of the form 1 or 2 will run in less than 300 ms:
-- form 1: unique text table
SELECT count(*) FROM jfoobar
WHERE
keywords && '{ford, toyota, volkswagen, saturn, honda, porsche,
hummer, ferrari}'::text[];

-- form 2: developped text table
SELECT count(*) FROM jfoobar
WHERE
(
keywords && '{ford}'::text[]
OR keywords && '{toyota}'::text[]
OR keywords && '{volkswagen}'::text[]
OR keywords && '{saturn}'::text[]
OR keywords && '{honda}'::text[]
OR keywords && '{porsche}'::text[]
OR keywords && '{hummer}'::text[]
OR keywords && '{ferrari}'::text[]
);
--

The runtime would only depend on the update strategy for the join table.

Regards.

--
nha / Lyon / France.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Abbas 2009-07-08 11:18:52 Re: Password?
Previous Message Sebastien FLAESCH 2009-07-08 11:05:27 Re: Normalize INTERVAL ouput format in a db driver