| From: | Volkan YAZICI <yazicivo(at)ttmail(dot)com> |
|---|---|
| To: | Andreas <maps(dot)on(at)gmx(dot)net> |
| Cc: | pgsql-sql(at)postgresql(dot)org |
| Subject: | Re: How to find double entries |
| Date: | 2008-04-16 06:15:42 |
| Message-ID: | 87zlru72b5.fsf@alamut.mobiliz.com.tr |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-sql |
On Wed, 16 Apr 2008, Andreas <maps(dot)on(at)gmx(dot)net> writes:
> how can I find double entries in varchar columns where the content is
> not 100% identical because of a spelling error or the person
> considered it "looked nicer" that way?
>
> I'd like to identify and then merge records of e.g. 'google',
> gogle', 'guugle'
>
> Then I want to match abbrevations like 'A-Company Ltd.', 'a company
> ltd.', 'A-Company Limited'
>
> Is there a way to do this?
> It would be OK just to list candidats up to be manually checked
> afterwards.
You can try something similar to below example. (levenshtein(text, text)
function is supplied by fuzzystrmatch module.)
SELECT T1.col, T2.col
FROM tbl AS T1,
INNER JOIN tbl AS T2
ON T1.col <> T2.col AND
levenshtein(T1.col, T2.col) < (length(T1.col) * 0.5)
Regards.
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Sumaya | 2008-04-16 06:59:46 | Multiple databases |
| Previous Message | Tena Sakai | 2008-04-16 05:05:58 | Re: How to find double entries |