Quick Links

Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
Date:	2014-07-12 02:16:17
Message-ID:	1405131377.9081.244.camel@jeff-desktop
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, 2014-07-11 at 11:51 -0400, Tom Lane wrote:
> Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> > Attached is a small patch to $SUBJECT.
> > In master, only single-byte characters are allowed as an escape. Of
> > course, with the patch it must still be a single character, but it may
> > be multi-byte.
>
> I'm concerned about the performance cost of this patch. Have you done
> any measurements about what kind of overhead you are putting on the
> inner loop of similar_escape?

I didn't consider this very performance critical, because this is
looping through the pattern, which I wouldn't expect to be a long
string. On my machine using en_US.UTF-8, the difference is imperceptible
for a SIMILAR TO ... ESCAPE query.

I was able to see about a 2% increase in runtime when using the
similar_escape function directly. I made a 10M tuple table and did:

explain analyze
select
similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') from t;

which was the worst reasonable case I could think of. (It appears that
selecting from a table is faster than from generate_series. I'm curious
what you use when testing the performance of an individual function at
the SQL level.)

> At the very least, please don't call GetDatabaseEncoding() over again
> every single time through the inner loop. More generally, why do we
> need to use pg_encoding_verifymb? The input data is presumably validly
> encoded. ISTM the significantly cheaper pg_mblen() would be more
> appropriate.

Thank you. Using the non-verifying variants reduces the penalty in the
above test to 1%.

If needed, we could optimize further through code specialization, as
like_escape() does. Though I think like_escape() is specialized just
because MatchText() is specialized; and MatchText is specialized because
it operates on the actual data, not just the pattern. So I don't see a
reason to specialize similar_escape().

Regards,
Jeff Davis

Attachment	Content-Type	Size
similar-escape.patch	text/x-patch	3.6 KB

In response to

Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING at 2014-07-11 15:51:10 from Tom Lane

Responses

Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING at 2014-08-25 13:01:13 from Heikki Linnakangas
Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING at 2014-08-25 14:41:19 from Heikki Linnakangas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Noah Misch	2014-07-12 03:43:50	Re: Crash on backend exit w/ OpenLDAP [2.4.24, 2.4.31]
Previous Message	Robert Haas	2014-07-11 21:07:38	Re: proposal: rounding up time value less than its unit.