From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
Cc: | Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: multibyte charater set in levenshtein function |
Date: | 2010-08-02 14:48:37 |
Message-ID: | AANLkTim0ZZs8zp6DM7qowV4jwyYhy1Dy+_fN6ApEWBf5@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
2010/8/2 Alexander Korotkov <aekorotkov(at)gmail(dot)com>:
> On Mon, Aug 2, 2010 at 5:20 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I reviewed this code in a fair amount of detail today and ended up
>> rewriting it. In general terms, it's best to avoid changing things
>> that are not relevant to the central purpose of the patch. This patch
>> randomly adds a whole bunch of whitespace and removes a whole bunch of
>> comments which I find useful and do not want to see removed. It took
>> me about an hour to reverse out all of those changes, which then let
>> me get down to the business of actually analyzing the code.
> I'm sorry. This is my first patch, I will be more accurate next time. But, I
> think there is no unified opinion about some changes like replacement "!m"
> with "m==0".
Sure; if that were the only such change, I wouldn't have mentioned it.
> I think this line is not correct:
> "if (m != s_bytes && n != t_bytes)"
Good catch, fixed in the attached version.
>> The attached patch takes the approach of using a fast-path for just
>> the innermost loop when neither string contains any multi-byte
>> characters (regardless of whether or not the encoding allows
>> multi-byte characters). It's still slower than CVS HEAD, but for
>> strings containing no multi-byte characters it's only marginally, if
>> at all, slower than previous releases, because 9.0 doesn't contain
>> your fix to avoid the extra string copy; and it's faster than your
>> implementation even when multi-byte characters ARE present. It is
>> slightly slower on single-byte encoding databases (I tested
>> SQL_ASCII), but still faster than previous releases. It's also quite
>> a bit less code duplication.
>
> Can I look at the test which shows that your implementation is faster than
> my when multi-byte characters are present? My tests show reverse. For
> example, with russian dictionary of openoffice:
Hmm, with the correction you mention above, I'm getting about the same
results with multi-byte characters (but faster with only single-byte
characters). I did this:
CREATE TABLE words (a text not null primary key);
COPY words from '/usr/share/dict/words';
SELECT SUM(levenshtein(a, 'foo')) from words;
SELECT SUM(levenshtein(a, 'Urbański')) FROM words;
SELECT SUM(levenshtein(a, 'ańs')) FROM words;
With the updated patch attached below, these ran about 102 ms, 222 ms,
129 ms. With your patch, I get about 134 ms, 222 ms, 130 ms. Perhaps
this is because I only have multi-byte characters in one of the
inputs, not both. I don't have a dictionary handy with multi-byte
words in it. Can you send me a file?
> I think that the cause of slow down slowdown is memcmp function. This
> function is very fast for long parts of memory, but of few bytes inline
> function is faster, because compiler have more freedom for optimization.
> After replacing memcmp with inline function like following in your
> implementation:
Yeah, that does seem to be slightly better. I've done a version of
this in the attached patch.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company
Attachment | Content-Type | Size |
---|---|---|
mb_levenshtein_rmh-v2.patch | application/octet-stream | 5.5 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Marc Cousin | 2010-08-02 15:27:07 | Re: lock_timeout GUC patch - Review |
Previous Message | Yeb Havinga | 2010-08-02 14:47:29 | Re: patch for check constraints using multiple inheritance |