Quick Links

Re: multibyte charater set in levenshtein function

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: multibyte charater set in levenshtein function
Date:	2010-06-06 20:00:08
Message-ID:	AANLkTinEIfNbX4cZx5GfH2iHbyRIAcMfCsx6hlm5QIj7@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hello Hackers!

I have extended my patch by introducing levenshtein_less_equal function.
This function have additional argument max_d and stops calculating when
distance exceeds max_d. With low values of max_d function works much faster
than original one.

The example of original levenshtein function usage:

test=# explain analyze select word, levenshtein(word, 'consistent') as dist
from words where levenshtein(word, 'consistent') <= 2 order by dist;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------
Sort (cost=2779.13..2830.38 rows=20502 width=8) (actual
time=203.652..203.658 rows=5 loops=1)
Sort Key: (levenshtein(word, 'consistent'::text))
Sort Method: quicksort Memory: 25kB
-> Seq Scan on words (cost=0.00..1310.83 rows=20502 width=8) (actual
time=19.019..203.601 rows=5 loops=1)
Filter: (levenshtein(word, 'consistent'::text) <= 2)
Total runtime: 203.723 ms
(6 rows)

Example of levenshtein_less_equal usage in this case:

test=# explain analyze select word, levenshtein_less_equal(word,
'consistent', 2) as dist from words where levenshtein_less_equal(word,
'consistent', 2) <= 2 order by dist;
QUERY PLAN

-------------------------------------------------------------------------------------------------------------
Sort (cost=2779.13..2830.38 rows=20502 width=8) (actual
time=42.198..42.203 rows=5 loops=1)
Sort Key: (levenshtein_less_equal(word, 'consistent'::text, 2))
Sort Method: quicksort Memory: 25kB
-> Seq Scan on words (cost=0.00..1310.83 rows=20502 width=8) (actual
time=5.391..42.143 rows=5 loops=1)
Filter: (levenshtein_less_equal(word, 'consistent'::text, 2) <= 2)
Total runtime: 42.292 ms
(6 rows)

In the example above levenshtein_less_equal works about 5 times faster.

With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
fuzzystrmatch-0.3.diff.gz	application/x-gzip	4.4 KB

In response to

Re: multibyte charater set in levenshtein function at 2010-05-12 20:13:58 from Alexander Korotkov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Alexander Korotkov	2010-06-06 20:04:10	Using multidimensional indexes in ordinal queries
Previous Message	Alexander Korotkov	2010-06-06 14:24:03	Parameters of GiST indexes