From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Notes about fixing regexes and UTF-8 (yet again) |
Date: | 2012-02-19 04:16:53 |
Message-ID: | 12859.1329625013@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> In theory you can imagine a regular expression engine where these
> decisions can be postponed until we see the string we're matching
> against. IOW, your DFA ends up with state transitions for characters
> specifically named, plus a state transition for "anything else that's
> a letter", plus a state transition for "anything else not otherwise
> specified". Then you only need to test the letters that actually
> appear in the target string, rather than all of the ones that might
> appear there.
> But implementing that could be quite a lot of work.
Yeah, not to mention slow. The difficulty is overlapping sets of
characters. As a simple example, if your regex refers to 3, 7,
[[:digit:]], X, and [[:alnum:]], then you end up needing five distinct
"colors": 3, 7, X, all digits that aren't 3 or 7, all alphanumerics
that aren't any of the preceding. And state transitions for the digit
and alnum cases had better mention all and only the correct colors.
I've been tracing through the logic this evening, and it works pretty
simply given that all named character classes are immediately expanded
out to their component characters. If we are going to try to keep
the classes in some kind of symbolic form, it's a lot messier. In
particular, I think your sketch above would lead to having to test
every character against iswdigit and iswalnum at runtime, which would
be disastrous performancewise. I'd like to at least avoid that for the
shorter (and presumably more common) UTF8 codes.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2012-02-19 04:17:51 | Re: Initial 9.2 pgbench write results |
Previous Message | Vik Reykja | 2012-02-19 04:11:42 | Re: Notes about fixing regexes and UTF-8 (yet again) |