Re: multibyte charater set in levenshtein function

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-07-21 18:47:21
Message-ID: AANLkTin8t2ZlF8AjAdQxa6-Wm9ZgiTinyfTEPkNwPgNt@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 21, 2010 at 10:25 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> *scratches head* Aren't you just moving the same call to a different
> place?
>
So, where you can find this different place? :) In this patch
null-terminated strings are not used at all.

> Yeah, we usually try to avoid changing that sort of thing in existing
> code, unless there's a very good reason.
>
Ok.

> In these case we should add many checks of max_d in levenshtein_internal
> > function which make code more complex.
>
> When you say "many" checks, how many?
>
> > Actually, we can merge all four functions into one function. But such
> > function will have many checks about multibyte encoding and max_d. So, I
> see
> > four cases here:
> > 1) one function with checks for multibyte encoding and max_d
> > 2) two functions with checks for multibyte encoding
> > 3) two functions with checks for max_d
> > 4) four separate functions
> > If you prefer case number 3 you should argue your position little more.
>
> I'm somewhat convinced that separating the multibyte case out has a
> performance benefit both by intuition and because you posted some
> numbers, but I haven't seen any argument for separating out the other
> case, so I'm asking if you've checked and whether there is an effect
> and whether it's significant. The default is always to try to avoid
> maintaining multiple copies of substantially identical code, due to
> the danger that a future patch might fail to update all of them and
> thus introduce a bug.
>

I've tested it with big value of max_d and I thought that it's evident that
checking for negative value of max_d will not produce significant benefit.
Anyway, I tried to add checking for negative max_d into
levenshtein_less_equal_mb function.

static int
levenshtein_less_equal_internal_mb(char *s, char *t, int s_len, int t_len,
int ins_c, int del_c, int sub_c, int max_d)
{
int m,
n;
int *prev;
int *curr;
int i,
j;
const char *x;
const char *y;
CharLengthAndOffset *lengths_and_offsets;
int y_char_len;
int curr_left, curr_right, prev_left, prev_right, d;
int delta, min_d;

/*
* We should calculate number of characters for multibyte encodings
*/
m = pg_mbstrlen_with_len(s, s_len);
n = pg_mbstrlen_with_len(t, t_len);

/*
* We can transform an empty s into t with n insertions, or a non-empty
t
* into an empty s with m deletions.
*/
if (m == 0)
return n * ins_c;
if (n == 0)
return m * del_c;

/*
* We can find the minimal distance by the difference of lengths
*/
delta = m - n;
if (delta > 0)
min_d = delta * del_c;
else if (delta < 0)
min_d = - delta * ins_c;
else
min_d = 0;
if (max_d >= 0 && min_d > max_d)
return max_d + 1;

/*
* For security concerns, restrict excessive CPU+RAM usage. (This
* implementation uses O(m) memory and has O(mn) complexity.)
*/
if (m > MAX_LEVENSHTEIN_STRLEN ||
n > MAX_LEVENSHTEIN_STRLEN)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("argument exceeds the maximum length of %d bytes",
MAX_LEVENSHTEIN_STRLEN)));

/* One more cell for initialization column and row. */
++m;
++n;

/*
* Instead of building an (m+1)x(n+1) array, we'll use two different
* arrays of size m+1 for storing accumulated values. At each step one
* represents the "previous" row and one is the "current" row of the
* notional large array.
* For multibyte encoding we'll also store array of lengths of
* characters and array with character offsets in first string
* in order to avoid great number of
* pg_mblen calls.
*/
prev = (int *) palloc((2 * sizeof(int) + sizeof(CharLengthAndOffset)) *
m );
curr = prev + m;
lengths_and_offsets = (CharLengthAndOffset *)(prev + 2 * m);
lengths_and_offsets[0].offset = 0;
for (i = 0, x = s; i < m - 1; i++)
{
lengths_and_offsets[i].length = pg_mblen(x);
lengths_and_offsets[i + 1].offset = lengths_and_offsets[i].offset +
lengths_and_offsets[i].length;
x += lengths_and_offsets[i].length;
}
lengths_and_offsets[i].length = 0;

/* Initialize the "previous" row to 0..cols */
curr_left = 1;
d = min_d;
for (i = 0; i < delta; i++)
{
prev[i] = d;
}
curr_right = m;
for (; i < m; i++)
{
prev[i] = d;
d += (ins_c + del_c);
if (max_d >= 0 && d > max_d)
{
curr_right = i;
break;
}
}

/*
* There are following optimizations:
* 1) Actually the minimal possible value of final distance (in the case
of
* all possible matches) is stored is the cells of the matrix. In the
case
* of movement towards diagonal, which contain last cell, value should
be
* increased by ins_c + del_c. In the case of movement backwards this
* diagonal cell value should not be increased.
* 2) The range of indexes of previous row, where values, which is not
* greater than max_d, are stored, is [prev_left, prev_right]. So, only
* the cells connected with this range should be calculated.
* For multibyte encoding we should increase x and y pointers by the
* results of pg_mblen function. Also we should use CHAR_CMP macros
* for character comparison.
*/
for (y = t, j = 1; j < n; y+= y_char_len, j++)
{
int *temp;
y_char_len = pg_mblen(y);

if (max_d >= 0)
{
prev_left = curr_left;
prev_right = curr_right;
curr_left = -1;
}else{
prev_left = 1;
}

if (delta >= 0)
curr[0] = min_d + j * (ins_c + del_c);
else{
if (j <= - delta)
curr[0] = min_d;
else
curr[0] = min_d + (j + delta) * (ins_c + del_c);
}

for (i = prev_left, x = s + lengths_and_offsets[i - 1].offset; i <
m; x+= lengths_and_offsets[i - 1].length, i++)
{
if (max_d >= 0)
{
d = max_d + 1;

if (i <= prev_right){
d = Min(prev[i] + ((j + delta > i)?(ins_c + del_c):0),
d);
}

if (i == 1 || i > prev_left)
{
d = Min(curr[i - 1] + ((i - delta > j)?(ins_c +
del_c):0), d);
d = Min(prev[i - 1] + (char_cmp(x, lengths_and_offsets[i
- 1].length, y, y_char_len) ? 0 : sub_c), d);
}

curr[i] = d;

if (curr_left == -1)
curr_left = i;
curr_right = i;
if (i > prev_right)
break;
}else{
d = Min(Min(prev[i] + ((j + delta > i)?(ins_c + del_c):0),
curr[i - 1] + ((i - delta > j)?(ins_c + del_c):0)),
prev[i - 1] + (char_cmp(x, lengths_and_offsets[i -
1].length, y, y_char_len) ? 0 : sub_c));
curr[i] = d;
}
}

if (curr_left == -1)
break;

temp = curr;
curr = prev;
prev = temp;
}

/*
* Because the final value was swapped from the previous row to the
* current row, that's where we'll find it.
*/
d = prev[m - 1];

/*
* If the last cell wasn't filled than return max_d + 1 otherwise
* return the final value.
*/
if (max_d >= 0 && (curr_left == -1 || curr_right < m - 1))
return max_d + 1;
else
return d;
}

I tested it with american-english dictionary with 98569 words.

test=# select sum(levenshtein(word, 'qwerqwerqwer')) from words;
sum
---------
1074376
(1 row)

Time: 131,435 ms
test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',100)) from
words;
sum
---------
1074376
(1 row)

Time: 221,078 ms
test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',-1)) from
words;
sum
---------
1074376
(1 row)

Time: 254,819 ms

The function with negative value of max_d didn't become faster than with
just big value of max_d.

----
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Markus Wanner 2010-07-21 18:53:35 Re: dynamically allocating chunks from shared memory
Previous Message Merlin Moncure 2010-07-21 18:38:17 Re: patch: to_string, to_array functions