From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Notes about fixing regexes and UTF-8 (yet again) |
Date: | 2012-02-17 23:06:19 |
Message-ID: | 6555.1329519979@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Fri, Feb 17, 2012 at 10:19 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Before going much further with this, we should probably do some timings
>> of 64K calls of iswupper and friends, just to see how bad a dumb
>> implementation will be.
> Can't hurt.
The answer, on a reasonably new desktop machine (2.0GHz Xeon E5503)
running Fedora 16 in en_US.utf8 locale, is that 64K iterations of
pg_wc_isalpha or sibling functions requires a shade under 2ms.
So this definitely justifies caching the values to avoid computing
them more than once per session, but I'm not convinced there are
grounds for trying harder than that.
BTW, I am also a bit surprised to find out that this locale considers
48342 of those characters to satisfy isalpha(). Seems like a heck of
a lot. But anyway we can forget my idea of trying to save work by
incorporating a-priori assumptions about which Unicode codepoints are
which --- it'll be faster to just iterate through them all, at least
for that case. Maybe we should hard-wire some cases like digits, not
sure.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2012-02-17 23:39:14 | Re: elog and MemoryContextSwitchTo |
Previous Message | Gaetano Mendola | 2012-02-17 22:45:02 | elog and MemoryContextSwitchTo |