Re: Problem search on text arrays, using the overlaps (&&) operator

From: John Cheng <jlcheng(at)ymail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Problem search on text arrays, using the overlaps (&&) operator
Date: 2009-07-06 23:17:16
Message-ID: 220686.33449.qm@web43404.mail.sp1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

----- "nha" <lyondif02(at)free(dot)fr> wrote:

> From: "nha" <lyondif02(at)free(dot)fr>
> To: "John Cheng" <jlcheng(at)ymail(dot)com>
> Cc: pgsql-general(at)postgresql(dot)org
> Sent: Monday, July 6, 2009 9:12:22 AM GMT -08:00 US/Canada Pacific
> Subject: Re: [GENERAL] Problem search on text arrays, using the overlaps (&&) operator
>
> Hello,
>
> With your exhaustive example statements based on table foo and cars, I
>
> performed some measures on my side (PostgreSQL 8.3.1 server). Here are
>
> some statistical results:
>
[ ... snipped ... ]
>
> In my opinion, analysis and optimization may be deepen over table
> indexes used for join planning. As your reported query plans show, the
>
> Where clauses are performed independantly from the table ml_lead; the
>
> reason is that all the attributes of the clauses belong to the table
> lead_reporting_data. Time may be reduced on join condition
> achievements.
>
> Hoping this observation will contribute a little to your opinion.
>
> Without any claim, I attached a document to this email for details on
>
> the measures I took with the overlap operator -- OpenDocument
> Spreadsheet (ODS) v2 formatted file, 24 kiB. The 3rd sheet "various"
> presents the detailed measures related to the data reported in this
> email.
>
> Regards.
>
> --
> nha / Lyon / France.

Hi nha,

I had not expected anyone to go to such lengths to evaluate my
situation, thank you so much!

After looking at your analysis, I realized that the test case I
created isn't close enough to the queries running in our prod
environment. For one, table 'foo' does not join to another table; The
other thing is that the amount of data isn't the same; Finally, these
tables have been ANALYZED.

So I took some time to update the test case. On our server, running
8.3.6, I was able to reproduce the difference between the two styles:
"arr&&{f,b}" and "arr&&{f} or arr&&{b}".

First, the setup:

-- Begin test case
-- Sets up 'bar'
SELECT id INTO TEMP TABLE bar FROM (SELECT generate_series(1,300000) as id) AS bar;
CREATE INDEX bar_idx ON bar (id);
ANALYZE bar;
-- Sets up 'foo'
CREATE TEMP SEQUENCE foo_bar_id_seq;
CREATE TEMP TABLE foo (
bar_id numeric DEFAULT NEXTVAL('foo_bar_id_seq'),
keywords text[]
);
CREATE INDEX foo_idx ON foo USING gin (keywords);
INSERT INTO foo (keywords) VALUES ('{ford}'::text[]);
INSERT INTO foo (keywords) VALUES ('{toyota}'::text[]);
INSERT INTO foo (keywords) VALUES ('{volkswagen}'::text[]);
INSERT INTO foo (keywords) VALUES ('{saturn}'::text[]);
INSERT INTO foo (keywords) VALUES ('{honda}'::text[]);
INSERT INTO foo (keywords) VALUES ('{porsche}'::text[]);
INSERT INTO foo (keywords) VALUES ('{porsche, audi, chrysler}'::text[]);
INSERT INTO foo (keywords) VALUES ('{honda, hummer, ferrari}'::text[]);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
ANALYZE foo;
-- End test case

Query for the form "arr&&{f,b}"
SELECT
count(*)
FROM foo
INNER JOIN bar ON foo.bar_id = bar.id
WHERE
foo.keywords && '{ford, toyota, volkswagen, saturn, honda, porsche, hummer, ferrari}'::text[];

Query for the form "arr&&{f} or arr&&{b}":
SELECT
count(*)
FROM foo, bar
WHERE
foo.bar_id = bar.id
AND
(
keywords && '{ford}'::text[]
OR keywords && '{toyota}'::text[]
OR keywords && '{volkswagen}'::text[]
OR keywords && '{saturn}'::text[]
OR keywords && '{honda}'::text[]
OR keywords && '{porsche}'::text[]
OR keywords && '{hummer}'::text[]
OR keywords && '{ferrari}'::text[]
);

For the first form, "arr&&{f,b}", the query takes about 15
seconds. For the second form "arr&&{f} or arr&&{b}", we get about 8
seconds. The difference is around 90-100%, which is what I am seeing on
our real life queries.

The query plans also become similar to the real life query plans. But
I am having a hard time learning from them. The only interesting I see
is that the estimated cost seems to be different than the actual run
time. The second form has a higher estimated cost than the first form,
but has a lower run time.

In this test case, the query filters by any of 8 keywords. Note that
with just 2 keywords, the difference is only about 5 seconds. The
specific report that our users complained about involved 16
"keywords", where the difference is about 100%.

When you said "analysis and optimization may be deepen over table
indexes used for join planning", I'm not sure what you mean. Can you
clarify?

Browse pgsql-general by date

  From Date Subject
Next Message Ivan Sergio Borgonovo 2009-07-07 00:22:35 Re: combine multiple row values in to one row
Previous Message Lee Harr 2009-07-06 21:29:35 combine multiple row values in to one row