From: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
---|---|
To: | ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | like/ilike improvements |
Date: | 2007-05-22 15:58:33 |
Message-ID: | 46531329.4000802@dunslane.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers pgsql-patches |
Starting from a review of a patch from Itagaki Takahiro to improve LIKE
performance for UTF8-encoded databases, I have been working on improving
both efficiency of the LIKE/ILIKE code and the code quality.
The main efficiency improvement comes from some fairly tricky analysis
and discussion on -patches. Essentially there are two calls that we make
to advance the text and pattern cursors: NextByte and NextChar. In the
case of single byte charsets these are in fact the same thing, but in
multi byte charsets they are obviously not, and in that case NextChar is
a lot more expensive. It turns out (according to the analysis) that the
only time we actually need to use NextChar is when we are matching an
"_" in a like/ilike pattern. It also turns out that there are some
comparison tests that we can hoist out of a loop and thus avoid
repeating over and over. Also, some calls can be marked "inline" to
improve efficiency. Finally, the special case of computing lower(x) on
the fly for ILIKE comparisons on single byte charset strings turns out
to have the potential to call lower() O(n^2) times, so it has been
removed and we now treat foo ILIKE bar as lower(foo) LIKE lower(bar) for
all charsets uniformly. There will be cases where this approach wins and
cases where it loses, but the wins are potentially dramatic, whereas the
losses should be mild.
The current state of this work is at
http://archives.postgresql.org/pgsql-patches/2007-05/msg00385.php
I've been testing it using a set of 5m rows of random Latin1 data - each
row is between 100 and 400 chars long, and 20% of them (roughly) have
the string "foo" randomly located within them. The test platform is
gcc/fc6/AMD64.
I have loaded the data into both Latin1 and UTF8 encoded databases. (I'm
not sure if there are other multibyte charsets that are compatible with
Latin1 client encoding). My test is essentially:
select count(*) from footable where t like '%_foo%';
select count(*) from footable where t ilike '%_foo%';
select count(*) from footable where t like '%foo%';
select count(*) from footable where t ilike '%foo%';
Note that the "%_" case is probably the worst for these changes, since
it involves lots of calls to NextChar() (see above).
The multibyte results show significant improvement. The results are
about flat or a slight improvement for the singlebyte cases. I'll post
some numbers on this shortly.
But before I commit this I'd appreciate seeing some more testing, both
for correctness and performance.
cheers
andrew
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2007-05-22 16:12:51 | Re: like/ilike improvements |
Previous Message | Tom Lane | 2007-05-22 15:53:06 | Re: Inconsistant SQL results - Suspected error with query planing or query optimisation. |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2007-05-22 16:12:51 | Re: like/ilike improvements |
Previous Message | Alvaro Herrera | 2007-05-22 15:48:38 | Re: CREATE TABLE LIKE INCLUDING INDEXES support |