Quick Links

Re: looking for a faster way to do that

From:	hamann(dot)w(at)t-online(dot)de
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: looking for a faster way to do that
Date:	2011-09-23 07:45:51
Message-ID:	4E7C392F.mailGNP11KAXQ@amadeus3.local
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Alban Hertroys <haramrae(at)gmail(dot)com> wrote:

>> What is the output of explain?
>>
>> You say 'the other table', so presumably we're dealing with a foreign key
>> here. Is there an index on that column?

Albe Laurenz wrote:

>> Is the index used for "where code ~ '^ABC3563'"?
>>
>> If not, then the result is fast only because the table is scanned only once,
>> and it's just the factor of 3000 that's killing you.
>>
>> The second query (where code ~ wantcode) can never use an index because
>> the pattern "wantcode" is unknown at query planning time.
>>
>> Yours,
>> Laurenz Albe

Here I created a subset (just number and code matching a certain prefix)

select count(*) from items;
count
-------
9614

A single anchored query
select * from items where code ~ '^ABC';
does indeed use the index to retrieve data.

Next I copied a file of wanted codes

create temp table n (wantcode text);
\copy n from /tmp/rmartin.tmp

the file contains plain names, i.e. unanchored matches

explain analyze select num, n.wantcode from items, n where items.code ~ n.wantcode;
Nested Loop (cost=20.00..216502.14 rows=48070 width=36) (actual time=148.479..336280.488 rows=2871 loops=1)
Join Filter: (("outer".code)::text ~ "inner".wantcode)
-> Seq Scan on items (cost=0.00..167.14 rows=9614 width=42) (actual time=0.048..38.666 rows=9614 loops=1)
-> Materialize (cost=20.00..30.00 rows=1000 width=32) (actual time=0.001..1.049 rows=815 loops=9614)
-> Seq Scan on n (cost=0.00..20.00 rows=1000 width=32) (actual time=0.003..1.839 rows=815 loops=1)
Total runtime: 336286.692 ms

An exact match "where items.code = n.wantcode" on the same data completes in 40 ms

BTW: indexing the second table does not affect the query plan or the runtime, it just shows
actual row count rather than estimate.

This is, of course, bad; an anchored match could be faster and also is more appropriate
to the scenario. So I change the contents of the second table

update n set wantcode = textcat('^', wantcode);

and try again, with similar results
Nested Loop (cost=14.15..176478.01 rows=39178 width=36) (actual time=125.114..308831.697 rows=2871 loops=1)
Join Filter: (("outer".code)::text ~ "inner".wantcode)
-> Seq Scan on items (cost=0.00..167.14 rows=9614 width=42) (actual time=0.061..2034.572 rows=9614 loops=1)
-> Materialize (cost=14.15..22.30 rows=815 width=32) (actual time=0.001..1.095 rows=815 loops=9614)
-> Seq Scan on n (cost=0.00..14.15 rows=815 width=32) (actual time=0.114..1.893 rows=815 loops=1)
Total runtime: 308837.746 ms

I am aware that this is unlikely to work fast (the planner would perhaps need a hint in the query
rather than in the data column to choose an anchored match algorithm (in case there is
such an algo, of course)

So I wonder whether there might be a different approach to this problem rather than
pattern matching.
I recall I had a similar problem before with a "contacts" column possibly containing one or more
email addresses. Here searches would also be number of people times number of requests
performance. I finally ended up with a @@ match (contrib/tsquery) and a supporting GIST index,
but that only supports exact match, not prefix

Regards
Wolfgang Hamann

Responses

Re: looking for a faster way to do that at 2011-09-23 10:23:28 from Eduardo Morras
Re: looking for a faster way to do that at 2011-09-23 10:39:33 from Alban Hertroys

Browse pgsql-general by date

	From	Date	Subject
Next Message	Andrew Rose	2011-09-23 09:47:15	Relative performance of prefix and suffix string matching
Previous Message	Mike Christensen	2011-09-23 07:03:20	Re: Materialized views in Oracle