Re: multibyte charater set in levenshtein function

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-07-21 18:25:47
Message-ID: AANLkTi=Jhb5Ke4ST1HRQjYfsDPQypbtEPQfDNxtGvC4P@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 21, 2010 at 7:40 AM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> On Wed, Jul 21, 2010 at 5:54 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> This patch still needs some work.  It includes a bunch of stylistic
>> changes that aren't relevant to the purpose of the patch.  There's no
>> reason that I can see to change the existing levenshtein_internal
>> function to take text arguments instead of char *, or to change !m to
>> m == 0 in existing code, or to change the whitespace in the comments
>> of that function.  All of those need to be reverted before we can
>> consider committing this.
>
> I changed arguments of function from char * to text * in order to avoid
> text_to_cstring call.

*scratches head* Aren't you just moving the same call to a different place?

> Same benefit can be achived by replacing char * with
> char * and length.
> I changed !m to m == 0 because Itagaki asked me to make it conforming coding
> style. Do you think there is no reason to fix coding style in existing
> code?

Yeah, we usually try to avoid changing that sort of thing in existing
code, unless there's a very good reason.

>> There is a huge amount of duplicated code here.  I think it makes
>> sense to have the multibyte version of the function be separate, but
>> perhaps we can merge the less-than-or-equal to bits  into the main
>> code, so that we only have two copies instead of four.  Perhaps we
>> can't just add a max_d argument max_distance to levenshtein_internal;
>> and if this value is >=0 then it represents the max allowable
>> distance, but if it is <0 then there is no limit.  Sure, that might
>> slow down the existing code a bit, but it might not be significant.
>> I'd at least like to see some numbers showing that it is significant
>> before we go to this much trouble.
>
> In these case we should add many checks of max_d in levenshtein_internal
> function which make code more complex.

When you say "many" checks, how many?

> Actually, we can merge all four functions into one function. But such
> function will have many checks about multibyte encoding and max_d. So, I see
> four cases here:
> 1) one function with checks for multibyte encoding and max_d
> 2) two functions with checks for multibyte encoding
> 3) two functions with checks for max_d
> 4) four separate functions
> If you prefer case number 3 you should argue your position little more.

I'm somewhat convinced that separating the multibyte case out has a
performance benefit both by intuition and because you posted some
numbers, but I haven't seen any argument for separating out the other
case, so I'm asking if you've checked and whether there is an effect
and whether it's significant. The default is always to try to avoid
maintaining multiple copies of substantially identical code, due to
the danger that a future patch might fail to update all of them and
thus introduce a bug.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-07-21 18:28:05 Re: patch: to_string, to_array functions
Previous Message Pavel Stehule 2010-07-21 18:25:46 Re: patch: to_string, to_array functions