Re: LIKE search and performance

From: Richard Huxton <dev(at)archonet(dot)com>
To: mark(at)mark(dot)mielke(dot)cc
Cc: Mark Lewis <mark(dot)lewis(at)mir3(dot)com>, James Mansion <james(at)mansionfamily(dot)plus(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Alexander Staubo <alex(at)purefiction(dot)net>, Andy <frum(at)ar-sd(dot)net>, pgsql-performance(at)postgresql(dot)org
Subject: Re: LIKE search and performance
Date: 2007-05-25 17:29:16
Message-ID: 46571CEC.7070202@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

mark(at)mark(dot)mielke(dot)cc wrote:
> On Fri, May 25, 2007 at 04:35:22PM +0100, Richard Huxton wrote:
>>> I notice you did not provide a real life example as requested. :-)
>> OK - any application that allows user-built queries: <choose column:
>> foo> <choose filter: contains> <choose target: "bar">
>> Want another? Any application that has a "search by name" box - users
>> can (and do) put one letter in and hit enter.
>> Unfortunately you don't always have control over the selectivity of
>> queries issued.
>
> The database has 10 million records. The user enters "bar" and it
> translates to "%bar%". You are suggesting that we expect bar to match
> 1 million+ records? :-)

I was saying that you don't know. At least, I don't know of any cheap
way of gathering full substring stats or doing a full substring
indexing. Even tsearch2 can't do that.

> I hope not. I would define this as bad process. I would also use "LIMIT"
> to something like "100".

Yes, but that's not the query we're talking about is it? If possible you
don't do '%bar%' searches at all. If you do, you try to restrict it
further or LIMIT the results. There's nothing to discuss in these cases.

>>> This seems like an ivory tower restriction. Not allowing best performance
>>> in a common situation vs not allowing worst performance in a not-so-common
>>> situation.
>> What best performance plan are you thinking of? I'm assuming we're
>> talking about trailing-wildcard matches here, rather than "contains"
>> style matches.
>
> "Trailing-wildcard" already uses B-Tree index, does it not?

Yes, it searches the btree and then checks the data for visibility. I
thought that was what you felt could be worked around. It appears I was
wrong.

> I am speaking of contains, as contains is the one that was said to
> require a seqscan. I am questioning why it requires a seqscan.

Well, you seemed to be suggesting you had something better in mind. At
least, that was my reading of your original post.

> The
> claim was made that with MVCC, the index is insufficient to check
> for visibility

True, for PG's implementation of MVCC. You *could* have visibility in
each index, but that obviously would take more space. For a table with
many indexes, that could be a *lot* more space. You also have to update
all that visibilty information too.

> and that the table would need to be accessed anyways,
> therefore a seqscan is required. I question whether a like '%bar%'
> should be considered a high selectivity query in the general case.
> I question whether a worst case should be assumed.

Well, the general rule-of-thumb is only about 10% for the changeover
between index & seq-scan. That is, once you are reading 10% of the rows
on disk (to check visibility) you might as well read them all (since
you'll be reading most of the blocks anyway if the rows are randomly
distributed). If you are doing SELECT * from that table then you'll want
all that data you read. If you are doing SELECT count(*) then you only
wanted the visibility :-(

Now you and I can look at a substring and probably make a good guess how
common it is (assuming we know the targets are British surnames or
Japanese towns). PG needs one number - or rather, it picks one number
for each length of search-string (afaik).

> Perhaps I question too much? :-)

Not sure it's possible to question too much :-)
However, you need to provide answers occasionally too - what numbers
would you pick? :-)

--
Richard Huxton
Archonet Ltd

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message PFC 2007-05-25 17:31:16 Re: LIKE search and performance
Previous Message Arnau 2007-05-25 17:16:05 Re: How PostgreSQL handles multiple DDBB instances?