From: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
---|---|
To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Speeding up text_position_next with multibyte encodings |
Date: | 2018-10-19 12:52:59 |
Message-ID: | 3173d989-bc1c-fc8a-3b69-f24246f73876@iki.fi |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Attached is a patch to speed up text_position_setup/next(), in some
common cases with multibyte encodings.
text_position_next() uses the Boyer-Moore-Horspool search algorithm,
with a skip table. Currently, with a multi-byte encoding, we first
convert both input strings to arrays of wchars, and run the algorithm on
those arrays.
This patch removes the mb->wchar conversion, and instead runs the
single-byte version of the algorithm directly on the inputs, even with
multi-byte encodings. That avoids the mb->wchar conversion altogether,
when there is no match. Even when there is a match, we don't need to
convert the whole input string to wchars. It's enough to count the
characters up to the match, using pg_mblen(). And when
text_position_setup/next() are used as part of split_part() or replace()
functions, we're not actually even interested in the character position,
so we can skip that too.
Avoiding the large allocation is nice, too. That was actually why I
stated to look into this: a customer complained that text_position fails
with strings larger than 256 MB.
Using the byte-oriented search on multibyte strings might return false
positives, though. To make that work, after finding a match, we verify
that the match doesn't fall in the middle of a multi-byte character.
However, as an important special case, that cannot happen with UTF-8,
because it has the property that the byte sequence of one character
never appears as part of another character. I think some other encodings
might have the same property, but I wasn't sure.
This is a win in most cases. One case is slower: calling position() with
a large haystack input, where the match is near the end of the string.
Counting the characters up to that point is slower than the mb->wchar
conversion. I think we could avoid that regression too, if we had a more
optimized function to count characters. Currently, the code calls
pg_mblen() in a loop, whereas in pg_mb2wchar_with_len(), the similar
loop through all the characters is a tight loop within the function.
Thoughts?
- Heikki
Attachment | Content-Type | Size |
---|---|---|
0001-Use-single-byte-Boyer-Moore-Horspool-search-even-wit.patch | text/x-patch | 24.6 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | John Naylor | 2018-10-19 13:08:50 | ERROR's turning FATAL in BRIN regression tests |
Previous Message | Masahiko Sawada | 2018-10-19 12:38:35 | Re: [HACKERS] Transactions involving multiple postgres foreign servers, take 2 |