From: | Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp> |
---|---|
To: | ZeugswetterA(at)spardat(dot)at |
Cc: | tgl(at)sss(dot)pgh(dot)pa(dot)us, pgman(at)candle(dot)pha(dot)pa(dot)us, phede-ml(at)islande(dot)org, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Unicode combining characters |
Date: | 2001-10-04 02:16:42 |
Message-ID: | 20011004111642R.t-ishii@sra.co.jp |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers pgsql-patches |
> Ok. I ran the modified test (now the iteration is reduced to 100000 in
> liketest()). As you can see, there's huge difference. MB seems up to
> ~8 times slower:-< There seems some problems existing in the
> implementation. Considering REGEX is not so slow, maybe we should
> employ the same design as REGEX. i.e. using wide charcters, not
> multibyte streams...
>
> MB+LIKE
> Total runtime: 1321.58 msec
> Total runtime: 1718.03 msec
> Total runtime: 2519.97 msec
> Total runtime: 4187.05 msec
> Total runtime: 7629.24 msec
> Total runtime: 14456.45 msec
> Total runtime: 17320.14 msec
> Total runtime: 17323.65 msec
> Total runtime: 17321.51 msec
>
> noMB+LIKE
> Total runtime: 964.90 msec
> Total runtime: 993.09 msec
> Total runtime: 1057.40 msec
> Total runtime: 1192.68 msec
> Total runtime: 1494.59 msec
> Total runtime: 2078.75 msec
> Total runtime: 2328.77 msec
> Total runtime: 2326.38 msec
> Total runtime: 2330.53 msec
I did some trials with wide characters implementation and saw
virtually no improvement. My guess is the logic employed in LIKE is
too simple to hide the overhead of the multibyte and wide character
conversion. The reason why REGEX with MB is not so slow would be the
complexity of its logic, I think. As you can see in my previous
postings, $1 ~ $2 operation (this is logically same as a LIKE '%a%')
is, for example, almost 80 times slower than LIKE (remember that
likest() loops over 10 times more than regextest()).
So I decided to use a completely different approach. Now like has two
matching engines, one for single byte encodings (MatchText()), the
other is for multibyte ones (MBMatchText()). MatchText() is identical
to the non MB version of it, and virtually no performance penalty for
single byte encodings. MBMatchText() is for multibyte encodings and is
identical the one used in 7.1.
Here is the MB case result with SQL_ASCII encoding.
Total runtime: 901.69 msec
Total runtime: 939.08 msec
Total runtime: 993.60 msec
Total runtime: 1148.18 msec
Total runtime: 1434.92 msec
Total runtime: 2024.59 msec
Total runtime: 2288.50 msec
Total runtime: 2290.53 msec
Total runtime: 2316.00 msec
To accomplish this, I moved MatchText etc. to a separate file and now
like.c includes it *twice* (similar technique used in regexec()). This
makes like.o a little bit larger, but I believe this is worth for the
optimization.
--
Tatsuo Ishii
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2001-10-04 02:47:28 | Re: BUG: text(varchar) truncates at 31 bytes |
Previous Message | Laurette Cisneros | 2001-10-04 00:02:59 | Timestamp, fractional seconds problem |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2001-10-04 03:05:16 | Re: Unicode combining characters |
Previous Message | Bruce Momjian | 2001-10-04 00:54:57 | Re: Trailing semicolons in psql patch |