From: | fork <forkandwait(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Generalized edit function? |
Date: | 2011-02-26 20:30:19 |
Message-ID: | loom.20110226T211748-290@post.gmane.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi hackers,
I am interested in extending Postgres with a "generalized edit function" like
SAS's "compged"[1], which is basically levenshtein distance with transposes (ab
<-> ba) and LOTS of different weights for certain ops (like insert a blank
versus delete from the end versus insert a regular character).
Compged seems to work really well for us when trying to match addresses (MUCH
better than pure levenshtein), and it would be a great tool for data miners.
I have a number of questions:
1. Does anybody else care? I would love to see this in contrib, but if the
chances are slim, then I would like to know that too.
2. Has anybody else done something like this and can give ideas or source? It
seems to me that the code will have to be a mess of pointers and indexes, but if
there is some theory that simplifies it I haven't heard about it. (Levenshtein
without transposes is theoretically clean, but I think the fact that we have
transposes means we look ahead 2 chars and lose all the nice dynamic programming
stuff.)
3. I will probably implement this for ascii characters -- if anyone has any
thoughts on other encodings, please share.
Thanks for everyone's time. I will try to implement a command line version and
put that on pastebin for people to look at while I port it to the postgres
environment.
[1]
(http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206133.htm)
From | Date | Subject | |
---|---|---|---|
Next Message | Kevin Grittner | 2011-02-26 20:33:48 | Re: WIP: cross column correlation ... |
Previous Message | Josh Berkus | 2011-02-26 20:22:56 | Re: disposition of remaining patches |