From: | "scott(dot)marlowe" <scott(dot)marlowe(at)ihs(dot)com> |
---|---|
To: | Grega Bremec <grega(dot)bremec(at)noviforum(dot)si> |
Cc: | <pgsql-performance(at)postgresql(dot)org> |
Subject: | Re: Wildcard searches & performance question |
Date: | 2003-05-27 19:48:39 |
Message-ID: | Pine.LNX.4.33.0305271342210.12134-100000@css120.ihs.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
What you want is full text searching. To see it in action go here:
To download it go here:
http://openfts.sourceforge.net/
There is also the older, and slightly slower full text indexing engine,
included in the /contrib/fulltextindex directory. It's a little more
wrung out, but also not as likely to get maintenance in the future.
Basically, full text indexing does exactly what you're asking for by
indexing each row inserted by each word in it (that isn't a noise word
like "the" or "a") and then uses the indexes created for its searches.
On Tue, 27 May 2003, Grega Bremec wrote:
> Hello,
>
> I have a database with the following layout:
>
> searchterms=# \d+ searches_2002
> Table "public.searches_2002"
> Column | Type | Modifiers | Description
> -----------+------------------------+-----------+-------------
> srchdate | date | not null |
> srchtime | time without time zone | not null |
> client_ip | inet | not null |
> srchquery | character varying(50) | not null |
> fhvalue | smallint | |
> Indexes: searches_2002_client_ip btree (client_ip),
> searches_2002_srchdate btree (srchdate),
> searches_2002_srchdatetime btree (srchdate, srchtime),
> searches_2002_srchquery btree (srchquery),
> searches_2002_srchquery_lcase btree (lower(srchquery)),
> searches_2002_srchquery_withfh btree (srchquery, fhvalue),
> searches_2002_srchtime btree (srchtime)
>
> There are no uniqueness properties that would make it possible for this table
> to have a primary key, as it is a list of searches performed on a search
> engine and the users' behaviour is, well... umm, odd, to be mild. :)
>
> Also, do note that this is a test table, so nevermind the excessive amount of
> indexes - performance is not an issue here, I am still evaluating the need and
> benefits of having various indexes on those columns.
>
> The particular case I am interested is this: when executing queries involving
> pattern searches using various operators on srchquery, none of the indexes are
> used in certain cases, namely those LIKE and regexp filters that start with
> a wildcard.
>
> This makes perfect sense, because wildcard pattern searches that start with a
> wildcard, can not really benefit from an index scan, because a sequential scan
> is probably going to be faster: we are only going to benefit from scanning an
> index in those special cases where the wildcard evaluates to a zero-length
> string.
>
> One example of a query plan:
>
> searchterms=# explain select count(*)
> from searches_2002
> where srchquery like '%curriculum%';
> QUERY PLAN
> --------------------------------------------------------------------------
> Aggregate (cost=4583.26..4583.26 rows=1 width=0)
> -> Seq Scan on searches_2002 (cost=0.00..4583.26 rows=1 width=0)
> Filter: (srchquery ~~ '%curriculum%'::text)
>
> There is 211061 records in this test table, but the real-life tables would
> contain a much much larger amount of data, more like 50+ million rows.
>
> This promise of a hell on earth trying to optimize performance makes me wonder:
> would there be a sensible way/reason for avoiding sequential scans on queries
> that start with a wildcard, and would avoiding sequential scans even be
> feasible in such cases? Or in other words, can I somehow optimize LIKE and
> regexp queries that start with a wildcard?
>
> TIA,
>
From | Date | Subject | |
---|---|---|---|
Next Message | Josh Berkus | 2003-05-27 19:55:46 | Re: Wildcard searches & performance question |
Previous Message | Andrew Sullivan | 2003-05-27 19:37:00 | Re: Wildcard searches & performance question |