From: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> |
---|---|
To: | Jeff Davis <pgsql(at)j-davis(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING |
Date: | 2014-08-25 14:41:19 |
Message-ID: | 53FB4B0F.9080502@vmware.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 07/12/2014 05:16 AM, Jeff Davis wrote:
> On Fri, 2014-07-11 at 11:51 -0400, Tom Lane wrote:
>> Jeff Davis <pgsql(at)j-davis(dot)com> writes:
>>> Attached is a small patch to $SUBJECT.
>>> In master, only single-byte characters are allowed as an escape. Of
>>> course, with the patch it must still be a single character, but it may
>>> be multi-byte.
>>
>> I'm concerned about the performance cost of this patch. Have you done
>> any measurements about what kind of overhead you are putting on the
>> inner loop of similar_escape?
>
> I didn't consider this very performance critical, because this is
> looping through the pattern, which I wouldn't expect to be a long
> string. On my machine using en_US.UTF-8, the difference is imperceptible
> for a SIMILAR TO ... ESCAPE query.
>
> I was able to see about a 2% increase in runtime when using the
> similar_escape function directly. I made a 10M tuple table and did:
>
> explain analyze
> select
> similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') from t;
>
> which was the worst reasonable case I could think of.
Actually, that gets optimized to a constant in the planner:
postgres=# explain verbose select
similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#')
from t;
QUERY PLAN
--------------------------------------------------------------------------------
----------
Seq Scan on public.t (cost=0.00..144247.85 rows=9999985 width=0)
Output:
'^(?:ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ
)$'::text
Planning time: 0.033 ms
(3 rows)
With a working test case:
create table t (pattern text);
insert into t select
'ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ' from
generate_series(1, 1000000);
vacuum t;
explain (analyze) select similar_escape(pattern,'#') from t;
your patch seems to be about 2x-3x as slow as unpatched master. So this
needs some optimization. A couple of ideas:
1. If the escape string is in fact a single-byte character, you can
proceed with the loop just as it is today, without the pg_mblen calls.
2. Since pg_mblen() will always return an integer between 1-6, it would
probably be faster to replace the memcpy() and memcmp() calls with
simple for-loops iterating byte-by-byte.
In very brief testing, with the 1. change above, the performance with
this patch is back to what it's without the patch. See attached.
- Heikki
Attachment | Content-Type | Size |
---|---|---|
similar-escape-2.patch | text/x-diff | 3.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Heikki Linnakangas | 2014-08-25 14:51:51 | Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING |
Previous Message | Petr Jelinek | 2014-08-25 14:15:07 | Re: Built-in binning functions |