From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
Cc: | Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: multibyte charater set in levenshtein function |
Date: | 2010-08-02 20:53:05 |
Message-ID: | AANLkTimZcAjwCJyRgwKaM3Ms_0ipO8NteVCnfgKCf7RL@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
2010/8/2 Alexander Korotkov <aekorotkov(at)gmail(dot)com>:
> The dump of the table with russian dictionary is in attachment.
>
> I use following tests:
> SELECT SUM(levenshtein(a, 'foo')) from words;
> SELECT SUM(levenshtein(a, 'Urbański')) FROM words;
> SELECT SUM(levenshtein(a, 'ańs')) FROM words;
> SELECT SUM(levenshtein(a, 'foo')) from words2;
> SELECT SUM(levenshtein(a, 'дом')) FROM words2;
> SELECT SUM(levenshtein(a, 'компьютер')) FROM words2;
>
> With your last version of patch results are:
> 63ms 94ms 61ms 100ms 121ms 217ms.
>
> But I found some other optimization. With this condition:
> if (x[x_char_len-1] == y[y_char_len-1] && x_char_len == y_char_len &&
> (x_char_len == 1 || char_same(x, y, x_char_len - 1)))
> results become:
> 58ms 96ms 63ms 102ms 107ms 171ms
>
> On single-byte characters results almost didn't change, but they come better
> with multi-byte characters. Generally, this improvement is because first
> bytes of different characters are frequently same (for example, almost all
> Cyrillic characters start from same byte, and I think that there is similar
> situation in other alphabets), but match of last bytes is just a
> coincidence.
Hey, that's really cool. It helps a lot here, too:
My previous patch, with your 5 test cases:
Time: 100.556 ms
Time: 205.254 ms
Time: 127.070 ms
Time: 77.734 ms
Time: 90.345 ms
Time: 166.954 ms
Your original patch, same 5 test cases:
Time: 131.489 ms
Time: 215.048 ms
Time: 125.471 ms
Time: 80.068 ms
Time: 87.110 ms
Time: 166.918 ms
My patch, with your proposed change to compare the last character
first, same 5 test cases:
Time: 96.489 ms
Time: 201.497 ms
Time: 121.876 ms
Time: 75.005 ms
Time: 80.219 ms
Time: 142.844 ms
Updated patch attached.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company
Attachment | Content-Type | Size |
---|---|---|
mb_levenshtein_rmh-v3.patch | application/octet-stream | 5.8 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Alexander Korotkov | 2010-08-02 21:07:31 | Re: multibyte charater set in levenshtein function |
Previous Message | Robert Haas | 2010-08-02 20:21:04 | Re: patch for check constraints using multiple inheritance |