Re: Merge rows based on Levenshtein distance

From: David G Johnston <david(dot)g(dot)johnston(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Merge rows based on Levenshtein distance
Date: 2014-12-03 18:44:43
Message-ID: CAKFQuwYR0bAcgaMc-UthUSQ9UTX_TH43Dsq-xz1ZGHhhruEPTA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Dec 3, 2014 at 9:14 AM, pinker [via PostgreSQL] <
ml-node+s1045698n5829111h65(at)n5(dot)nabble(dot)com> wrote:

> There is nice extension in postgres: fuzzystrmatch
> <http://www.postgresql.org/docs/9.3/static/fuzzystrmatch.html> I have
> used to calculate the distance. From documetation:
>
> SELECT levenshtein_less_equal('extensive', 'exhaustive',2);
>
> You can use it then with your group by query.
>
>
​Something like this - replace the substring(...) comparison with
legenshtein_less_equal(...)​ or whatever comparison you find applicable.

In the case below new groups are started whenever the first letter of the
value changes.

The first group would be NULL so I add a COALESCE() call to make it 0 -
subsequent groups start with 1 and increment properly.

WITH src (val) AS (
VALUES ('A1'::varchar),('A2'),('B1'),('B2'),('B3'),('C1'),('D1')
)
, grp AS (
SELECT val
, CASE WHEN
substring(val,1,1) <> substring(lag(val) OVER (ORDER BY
val),1,1)
THEN 1
ELSE NULL
END AS changed
, ROW_NUMBER() OVER (ORDER BY val) AS val_idx
FROM src
)
SELECT val, COALESCE(sum(changed) OVER (ORDER BY val_idx), 0) AS group_id
FROM grp
​;

David J.​

--
View this message in context: http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5829143.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Michael Nolan 2014-12-03 18:51:46 Re: Merge rows based on Levenshtein distance
Previous Message David G Johnston 2014-12-03 18:12:58 Re: Merge rows based on Levenshtein distance