From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
Cc: | "Joel Jacobson" <joel(at)compiler(dot)org> |
Subject: | Bizarre behavior of \w in a regular expression bracket construct |
Date: | 2021-02-20 22:20:19 |
Message-ID: | 3220564.1613859619@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Our documentation says specifically "A character class cannot be used
as an endpoint of a range." This should apply to the character class
shorthand escapes (\d and so on) too, and for the most part it does:
# select 'x' ~ '[\d-a]';
ERROR: invalid regular expression: invalid character range
However, certain combinations involving \w don't throw any error:
# select 'x' ~ '[\w-a]';
?column?
----------
t
(1 row)
while others do:
# select 'x' ~ '[\w-;]';
ERROR: invalid regular expression: invalid character range
It turns out that what's happening here is that \w is being
macro-expanded into "[:alnum:]_" (see the brbackw[] constant
in regc_lex.c), so then we have
select 'x' ~ '[[:alnum:]_-a]';
and that's valid as long as '_' is less than the trailing
range bound. The fact that we're using REG_ERANGE for both
"range syntax botch" and "range start is greater than range
end" helps to mask the fact that the wrong thing is happening,
i.e. my last example above is giving the right error string
for the wrong reason.
I thought of changing the expansion to "_[:alnum:]" but of
course that just moves the problem around: then some cases
with \w after a dash would be accepted when they shouldn't be.
I have a patch in progress that gets rid of the hokey macro
expansion implementation of \w and friends, and I noticed
this issue because it started to reject "[\w-_]", which our
existing code accepts. There's a bunch of examples like that
in Joel's Javascript regex corpus. I suspect that Javascript
is reading such cases as "\w plus the literal characters '-'
and '_'", but I'm not 100% sure of that.
Anyway, I don't see any non-invasive way to fix this in the
back branches, and I'm not sure that anyone would appreciate
our changing it in stable branches anyway. But I wanted to
document the issue for the record.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2021-02-20 22:20:24 | Re: Printing page request trace from buffer manager |
Previous Message | Guillaume Lelarge | 2021-02-20 21:39:24 | Re: Extensions not dumped when --schema is used |