Re: General guidance: Levenshtein distance versus other similarity algorithms

From: Rachel Owsley <Rachel(dot)Owsley(at)edointeractive(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: General guidance: Levenshtein distance versus other similarity algorithms
Date: 2012-07-25 20:15:33
Message-ID: 81F2AED71E996746829AC866496B2EA361B38FC6DE@MAIL-NASH01.edo.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Thanks, Merlin. I will give that one a try.

-----Original Message-----
From: Merlin Moncure [mailto:mmoncure(at)gmail(dot)com]
Sent: Wednesday, July 25, 2012 1:32 PM
To: Rachel Owsley
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: [GENERAL] General guidance: Levenshtein distance versus other similarity algorithms

On Mon, Jul 23, 2012 at 11:55 AM, Rachel Owsley <Rachel(dot)Owsley(at)edointeractive(dot)com> wrote:
> Hi,
>
> I am hoping you can give me some guidance here. I'm using postgresql 9.1.
>
> Basically, I'm trying to create a query on a table of businesses that
> will return all similar matches to a business name. This is a huge
> table, and there is a lot of variation in names. The length of the
> string can be up to 255. I've used regex, but there are always some
> variations of the name that are missed when I do a regex. So I decided to look at distance measures.
>
> Has anyone compared the fuzzstrmatch package to pgsimilarity?
>
> Would the levenshtein function in postgresql be the best way to go
> here? If so, should I use levenshtein in the contribution package or
> install the pgsimilarity package? Has anyone tried both implementations?

Another option that works with 9.1 is the pg_trgm module
(http://www.postgresql.org/docs/9.1/static/pgtrgm.html). It works
very well for 9.1 and has the advantage of having built-in gist and gin operator support.

Can't speak on pg_similarity, haven't used it.

merlin

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2012-07-25 20:35:06 Re: Smaller data types use same disk space
Previous Message Merlin Moncure 2012-07-25 18:31:34 Re: General guidance: Levenshtein distance versus other similarity algorithms