Re: optimizing SELECT with LIKE

From: Steve Atkins <steve(at)blighty(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: optimizing SELECT with LIKE
Date: 2002-06-12 17:48:47
Message-ID: 20020612104847.A69991@blighty.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Jun 12, 2002 at 11:43:25AM -0400, Jorge Sarmiento wrote:
> I have 3000000 rows in a database where I have to make a:
>
> SELECT name FROM table WHERE name LIKE '%firstname%' AND name LIKE
> '%secondname%';
>
> to obtain the data I need.
>
> Due to the data nature, there is no other way to look for the data, it's old
> data that was registered in paper years ago, with no searchable index... just
> names, that sometimes are wrote like: JORGE SARMIENTO, other times like:
> SARMIENTO, JORGE and other times like JORGE LUIS SARMIENTO. and due to legal
> reasons, the data must be entered the exact way it was wroten.

> Is there any way to optimize postgres for doing this kind of SELECT?

Not in any trivial way. The wildcard at the beginning of the match pattern
makes it impossible to use indexes.

One of the full text search modules might help, but a simpler way
might be to create a second table just to search on.

So from an input table like this

UID | NAME
-----------------------
1 | SARMIENTO, JORGE
2 | JORGE LUIS SARMIENTO

generate a search table with a perl script to convert the existing data
and index it on word. I've used code like this (untested, not complete):

$sh = $dbh->prepare("SELECT uid, name FROM table");
$ih = $dbh->prepare("INSERT INTO searchtable(uid, word) VALUES (?,?)");

$sh->execute();

while(($uid, $name) = $sh->fetchrow_array()) {
@cleanname = split(' ', uc(join(' ', reverse(split(',', $name)))));
foreach $el (@cleanname) {
$ih->execute($uid, $el);
}
}

UID | WORD
-----------------------
1 | SARMENIO
1 | JORGE
2 | JORGE
2 | LUIS
2 | SARMIENTO

Then you should be able to do a much more efficient query on the
search table to get the UID of the result.

SELECT uid FROM searchtable WHERE word='firstname' AND
uid IN (SELECT uid FROM searchtable WHERE word='lastname')

or the more efficient equivalent I'm sure the SQL gurus on the
list will pop up with.

An even simpler way would be to do a single pass through the data to
parse the names and add firstname and lastname columns, if that's
possible given the format of your data.

UID | NAME | FIRSTNAME | LASTNAME
-----------------------------------------------------
1 | SARMIENTO, JORGE | JORGE | SARMIENTO
2 | JORGE LUIS SARMIENTO | JORGE | SARMIENTO

I've used code something like this snippet (untested) for that:

$sh = $dbh->prepare("SELECT uid, name FROM table");
$uh = $dbh->prepare("UPDATE table SET firstname=?, lastname=? where uid=?");

$sh->execute();

while(($uid, $name) = $sh->fetchrow_array()) {
@cleanname = split(' ', uc(join(' ', reverse(split(',', $name)))));
$firstname = $cleanname[0];
$lastname = $cleanname[$#cleanname];
$uh->execute($firstname, $lastname, $uid);
}

All of the name munging and index generation could be done on insert triggers,
making it transparent to insertions of new data, but I tend to handle that
at an application level rather than within the database.

Cheers,
Steve

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Doug Fields 2002-06-12 17:59:34 Re: "set nice" running a transaction in a lower
Previous Message Clark C . Evans 2002-06-12 17:33:16 "set nice" running a transaction in a lower priority?