From: | David G Johnston <david(dot)g(dot)johnston(at)gmail(dot)com> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Merge rows based on Levenshtein distance |
Date: | 2014-12-03 18:44:43 |
Message-ID: | CAKFQuwYR0bAcgaMc-UthUSQ9UTX_TH43Dsq-xz1ZGHhhruEPTA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Wed, Dec 3, 2014 at 9:14 AM, pinker [via PostgreSQL] <
ml-node+s1045698n5829111h65(at)n5(dot)nabble(dot)com> wrote:
> There is nice extension in postgres: fuzzystrmatch
> <http://www.postgresql.org/docs/9.3/static/fuzzystrmatch.html> I have
> used to calculate the distance. From documetation:
>
> SELECT levenshtein_less_equal('extensive', 'exhaustive',2);
>
> You can use it then with your group by query.
>
>
Something like this - replace the substring(...) comparison with
legenshtein_less_equal(...) or whatever comparison you find applicable.
In the case below new groups are started whenever the first letter of the
value changes.
The first group would be NULL so I add a COALESCE() call to make it 0 -
subsequent groups start with 1 and increment properly.
WITH src (val) AS (
VALUES ('A1'::varchar),('A2'),('B1'),('B2'),('B3'),('C1'),('D1')
)
, grp AS (
SELECT val
, CASE WHEN
substring(val,1,1) <> substring(lag(val) OVER (ORDER BY
val),1,1)
THEN 1
ELSE NULL
END AS changed
, ROW_NUMBER() OVER (ORDER BY val) AS val_idx
FROM src
)
SELECT val, COALESCE(sum(changed) OVER (ORDER BY val_idx), 0) AS group_id
FROM grp
;
David J.
--
View this message in context: http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5829143.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Nolan | 2014-12-03 18:51:46 | Re: Merge rows based on Levenshtein distance |
Previous Message | David G Johnston | 2014-12-03 18:12:58 | Re: Merge rows based on Levenshtein distance |