From: | Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com> |
---|---|
To: | Leif Biberg Kristensen <leif(at)solumslekt(dot)org> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Fuzzy string matching of product names |
Date: | 2010-04-06 10:02:47 |
Message-ID: | y2kdb471ace1004060302qcaa5788fjae649dd93206858f@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
> I've got a similar problem in my 18th century research, when clerks usually
> took pride in being able to spell a name in any number of ways. I've landed on
> a solution where I'm sending search strings to SIMILAR TO. I usually get far
> too many hits, but it's much easier to browse through 100 hits than the entire
> dataset which is approaching 60,000 records.
>
> Optimizing the search strings is based upon a lot of experience.
That sounds like an interesting problem - mine sounds mundane in comparison.
I now seem to be getting reasonable results with pg_trgm coupled with
ILIKE. I ORDER BY description ILIKE '%%usr_str%%' DESC,
similarity(description, 'usr_str') DESC , prioritising products that
actually contain the user specified string. I have tweaked the "is
greater than similarity" of my queries, to include a bit of rubbish to
be on the safe side, but not too much. I've done this with what I
believe to be a representative dataset. You can create a GiST or GIN
index on text fields for these queries, which is nice.
Have you tried using pg_trgm with your dataset? You might have more
success. It isn't biased towards a particular natural language. Also,
I suggest you avoid SQL regular expressions (which are supported with
SIMILAR TO), and use the ~ operator instead, which gives you the more
powerful POSIX regular expressions, unless portability is a major
concern.
Regards,
Peter Geoghegan
From | Date | Subject | |
---|---|---|---|
Next Message | Björn Lindqvist | 2010-04-06 11:28:47 | Re: Completely wrong row estimates |
Previous Message | John R Pierce | 2010-04-06 07:42:07 | Re: desktop heap usage in windows |