Re: Merge rows based on Levenshtein distance

From: David G Johnston <david(dot)g(dot)johnston(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Merge rows based on Levenshtein distance
Date: 2014-12-02 00:49:41
Message-ID: 1417481381890-5828847.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

mongoose wrote
> I am new to PostgreSQL and I have the following table:
>
> Name, City
> "Alex", "Washington"
> "Aleex1", "Washington"
> "Bob", "NYC"
> "Booob", "NYC"
>
> I want to "merge" similar rows based on levenshtein distance between names
> so that I have the following table:
>
> id, Name, City
> 1,"Alex", "Washington"
> 1,"Aleex1", "Washington"
> 2,"Bob", "NYC"
> 2,"Booob", "NYC"
>
> How could I do that on PostgreSQL? Is there an SQL command for this?
> Thnsls

So you have a table of N names and you want to evaluate (N-1)^2 pairs and
then use the output of the levenshtein calculation to group them together.

SELECT
l_names.name_value,
r_names.name_value, leven[...](l_names.name_value, r_names.name_value) AS
pair_group
FROM table_of_names AS l_names
CROSS JOIN table_of_names AS r_names
WHERE l_names.name_value <> r_names.name_value
;

Feel free to add "group by city" or "WHERE substring(l_names.name_value, 0,
1) = substring(r_names.name_value, 0, 1)" since it seems you need more than
just a name-distance to generate the desired groups. You'd likely want to
add the same "substring" call to the SELECT-list and "GROUP BY" clauses...

David J.

--
View this message in context: http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5828847.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Bryn Jeffries 2014-12-02 00:50:58 Re: Irreversible SET ROLE
Previous Message Tom Lane 2014-12-02 00:39:33 Re: Irreversible SET ROLE