| From: | Steve Atkins <steve(at)blighty(dot)com> | 
|---|---|
| To: | pgsql-general(at)postgresql(dot)org | 
| Subject: | Re: optimizing SELECT with LIKE | 
| Date: | 2002-06-12 17:48:47 | 
| Message-ID: | 20020612104847.A69991@blighty.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-general | 
On Wed, Jun 12, 2002 at 11:43:25AM -0400, Jorge Sarmiento wrote:
> I have 3000000 rows in a database where I have to make a:
> 
> SELECT name FROM table WHERE name LIKE '%firstname%' AND name LIKE 
> '%secondname%';
> 
> to obtain the data I need.
> 
> Due to the data nature, there is no other way to look for the data, it's old 
> data that was registered in paper years ago, with no searchable index... just 
> names, that sometimes are wrote like: JORGE SARMIENTO, other times like: 
> SARMIENTO, JORGE and other times like JORGE LUIS SARMIENTO. and due to legal 
> reasons, the data must be entered the exact way it was wroten.
> Is there any way to optimize postgres for doing this kind of SELECT?
Not in any trivial way. The wildcard at the beginning of the match pattern
makes it impossible to use indexes.
One of the full text search modules might help, but a simpler way
might be to create a second table just to search on.
So from an input table like this
UID  |  NAME
-----------------------
 1   | SARMIENTO, JORGE
 2   | JORGE LUIS SARMIENTO
generate a search table with a perl script to convert the existing data
and index it on word. I've used code like this (untested, not complete):
$sh = $dbh->prepare("SELECT uid, name FROM table");
$ih = $dbh->prepare("INSERT INTO searchtable(uid, word) VALUES (?,?)");
 
$sh->execute();
 
while(($uid, $name) = $sh->fetchrow_array()) {
  @cleanname = split(' ', uc(join(' ', reverse(split(',', $name)))));
  foreach $el (@cleanname) {
    $ih->execute($uid, $el);
  }
}
UID  | WORD
-----------------------
 1   | SARMENIO
 1   | JORGE
 2   | JORGE
 2   | LUIS
 2   | SARMIENTO
Then you should be able to do a much more efficient query on the
search table to get the UID of the result.
SELECT uid FROM searchtable WHERE word='firstname' AND
  uid IN (SELECT uid FROM searchtable WHERE word='lastname')
or the more efficient equivalent I'm sure the SQL gurus on the
list will pop up with.
An even simpler way would be to do a single pass through the data to
parse the names and add firstname and lastname columns, if that's
possible given the format of your data.
UID  |  NAME                | FIRSTNAME |   LASTNAME
-----------------------------------------------------
 1   | SARMIENTO, JORGE     | JORGE     | SARMIENTO
 2   | JORGE LUIS SARMIENTO | JORGE     | SARMIENTO
I've used code something like this snippet (untested) for that:
$sh = $dbh->prepare("SELECT uid, name FROM table");
$uh = $dbh->prepare("UPDATE table SET firstname=?, lastname=? where uid=?");
$sh->execute();
while(($uid, $name) = $sh->fetchrow_array()) {
  @cleanname = split(' ', uc(join(' ', reverse(split(',', $name)))));
  $firstname = $cleanname[0];
  $lastname = $cleanname[$#cleanname];
  $uh->execute($firstname, $lastname, $uid);
}
All of the name munging and index generation could be done on insert triggers,
making it transparent to insertions of new data, but I tend to handle that
at an application level rather than within the database.
Cheers,
  Steve
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Doug Fields | 2002-06-12 17:59:34 | Re: "set nice" running a transaction in a lower | 
| Previous Message | Clark C . Evans | 2002-06-12 17:33:16 | "set nice" running a transaction in a lower priority? |