From: | Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Fix quadratic performance of regexp match/split functions |
Date: | 2018-08-13 03:32:17 |
Message-ID: | 87pnyn55qh.fsf@news-spur.riddles.org.uk |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
While poking at that xml issue I tried to compare the memory usage of
the xml queries with vaguely comparable regexp_split_to_table queries,
and found that I could not do so; the regexp queries simply did not
complete in any sensible timeframe.
Investigation showed that in UTF8 (or other multibyte encodings), the
performance of regexp_matches and regexp_split_to_* is subject to
quadratic slowdowns thanks to the use of character offsets and calls to
text_substr, which must scan the string from the start each time to
count characters.
This patch changes that by keeping the wide-char copy of the string
available rather than freeing it, and converting the required substrings
back to the db encoding via a pre-allocated conversion buffer.
This gives noticable speedups on even very small strings (all timings
below are with -O2 --disable-cassert):
select count(*)
from (select 'aaaaa,aaaaa,aaaaa'::text as content
from generate_series(1,1000000) i offset 0) s,
regexp_split_to_table(content, ',') r;
-- ~10% faster with patch: 2.8 s -> 2.5 s
but on strings of even modest size (~1kb) the improvement is vast:
select count(*)
from (select repeat(repeat('a',10) || ',', 100+(i*0))::text
as content -- 1100 bytes, 101 matches
from generate_series(1,100000) i offset 0) s,
regexp_split_to_table(content, ',') r;
-- over 8 times faster: 51.8 sec -> 6.3 sec
and it only gets bigger:
select count(*)
from regexp_split_to_table(repeat('aaa,',10000), ',') r; -- 40KB
-- 270 times faster: 1628ms -> 6ms
This patch passes regression but I haven't yet tested it on complex
multibyte data (or non-UTF8 encodings but that shouldn't matter).
Comments?
--
Andrew (irc:RhodiumToad)
Attachment | Content-Type | Size |
---|---|---|
qregex.patch | text/x-patch | 8.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2018-08-13 03:52:20 | Re: Explain buffers wrong counter with parallel plans |
Previous Message | Robert Haas | 2018-08-13 03:03:31 | Re: Allowing printf("%m") only where it actually works |