From: | John Naylor <jcnaylor(at)gmail(dot)com> |
---|---|
To: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Speeding up text_position_next with multibyte encodings |
Date: | 2018-12-14 18:20:46 |
Message-ID: | CAJVSVGUVe9G6yUb3F-YrC_myrj_nqhagYrpmP3OT8sUjzvAuaA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 11/30/18, Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
> Unfortunately, patch doesn't compile anymore due:
>
> varlena.c: In function ‘text_position_next_internal’:
> varlena.c:1337:13: error: ‘start_ptr’ undeclared (first use in this
> function)
> Assert(start_ptr >= haystack && start_ptr <= haystack_end);
>
> Could you please send an updated version? For now I'm moving it to the next
> CF.
I signed up to be a reviewer, and I will be busy next month, so I went
ahead and fixed the typo in the patch that broke assert-enabled
builds. While at it, I standardized on the spelling "start_ptr" in a
few places to match the rest of the file. It's a bit concerning that
it wouldn't compile with asserts, but the patch was written by a
committer and seems to work.
On 10/19/18, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> This is a win in most cases. One case is slower: calling position() with
> a large haystack input, where the match is near the end of the string.
I wanted to test this case in detail, so I ran the attached script,
which runs position() in three scenarios:
short: 1 character needle at the end of a ~1000 character haystack,
repeated 1 million times
medium: 50 character needle at the end of a ~1 million character
haystack, repeated 1 million times
long: 250 character needle at the end of an ~80 million character
haystack (~230MB, comfortably below the 256MB limit Heikki reported),
done just one time
I took the average of 5 runs using Korean characters in both UTF-8
(3-byte) and EUC-KR (2-byte) encodings.
UTF-8
length master patch
short 2.26s 2.51s
med 2.28s 2.54s
long 3.07s 3.11s
EUC-KR
length master patch
short 2.29s 2.37s
med 2.29s 2.36s
long 1.75s 1.71s
With UTF-8, the patch is 11-12% slower on short and medium strings,
and about the same on long strings. With EUC-KR, the patch is about 3%
slower on short and medium strings, and 2-3% faster on long strings.
It seems the worst case is not that bad, and could be optimized, as
Heikki said.
-John Naylor
Attachment | Content-Type | Size |
---|---|---|
v2-0001-Use-single-byte-Boyer-Moore-Horspool-search-even-.patch | text/x-patch | 24.6 KB |
single-byte-bmh.sql | application/sql | 569 bytes |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2018-12-14 18:25:29 | Re: Ryu floating point output patch |
Previous Message | Robert Haas | 2018-12-14 18:16:30 | Re: removal of dangling temp tables |