From: | Craig Ringer <ringerc(at)ringerc(dot)id(dot)au> |
---|---|
To: | Andreas <maps(dot)on(at)gmx(dot)net> |
Cc: | pgsql-sql(at)postgresql(dot)org |
Subject: | Re: Is there a similarity-function that minds national charsets? |
Date: | 2012-06-21 02:53:00 |
Message-ID: | 4FE28C8C.9080300@ringerc.id.au |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-sql |
On 06/21/2012 12:30 AM, Andreas wrote:
> Hi,
>
> Is there a similarity-function that minds national charsets?
>
> Over here we've got some special cases that screw up the results on
> similarity().
>
> Our characters: ä, ö, ü, ß
> could as well be written as: ae, oe, ue, ss
>
> e.g.
>
> select similarity ( 'Müller', 'Mueller' )
> results to: 0.363636
>
> In normal cases everything below 0.5 would be to far apart to be
> considered a match.
That's not just charset aware, that's looking for awareness of
language-and-dialect specific transliteration rules for representing
accented chars in 7-bit ASCII. My understanding was that these rules and
conventions vary and are specific to each language - or even region.
tsearch2 has big language dictionaries to try to handle some issues like
this (though I don't know about this issue specifically). It's possible
you could extend the tsearch2 dictionaries with synonyms, possibly
algorithmically generated.
If you have what you consider to be an acceptable 1:1 translation rule
you could build a functional index on it and test against that, eg:
CREATE INDEX blah ON thetable ( (flatten_accent(target_column) );
SELECT similarity( flatten_accent('Müller'), target_column );
Note that the flatten_accent function must be IMMUTABLE and can't access
or refer to data in other tables, columns, etc nor SET (GUC) variables
that might change at runtime.
--
Craig Ringer
From | Date | Subject | |
---|---|---|---|
Next Message | Rihad | 2012-06-21 17:48:56 | Need help building this query |
Previous Message | Emi Lu | 2012-06-20 19:45:12 | Re: Simple method to format a string |