| From: | "Mark Woodward" <pgsql(at)mohawksoft(dot)com> |
|---|---|
| To: | "Mark Woodward" <pgsql(at)mohawksoft(dot)com> |
| Cc: | "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: String Similarity |
| Date: | 2006-05-20 13:12:21 |
| Message-ID: | 18626.24.91.171.78.1148130741.squirrel@mail.mohawksoft.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
> What I was hoping someone had was a function that could find the substring
> runs in something less than a strlen1*strlen2 number of operations and a
> numerically sane way of representing the similarity or difference.
Acually, it is more like strlen1*strlen2*N, where N is the number of valid
runs.
Unless someone has a GREAT algorithm, I think it will always be at least
strlen1*strlen2. The amount of processing for N is the question. Is N *
(strlen1*strlen2) less than sorting an array of N elements, scanning
through those elements and eliminating duplicate character matches?
Depending on the max value of N, I could save all the runs, sort by max
length, then exclude based on overlapp, but it isn't clear that this is a
performance win unless the strings are long, even then, I'm not completely
convinced as N still has some strlen ramifications for removing
duplicates.
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Mark Woodward | 2006-05-20 14:41:21 | Re: [OT] MySQL is bad, but THIS bad? |
| Previous Message | Dawid Kuroczko | 2006-05-20 12:29:01 | Re: [HACKERS] [OT] MySQL is bad, but THIS bad? |