From: | "Joel Jacobson" <joel(at)compiler(dot)org> |
---|---|
To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Bizarre behavior of \w in a regular expression bracket construct |
Date: | 2021-02-24 16:03:39 |
Message-ID: | 04845093-8b64-4293-87a8-42294b5493ac@www.fastmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Feb 24, 2021, at 16:23, I wrote:
>I will hopefully be able to provide a similar analysis of \D soon,
>but wanted to send this in the meantime.
CREATE TABLE "\D" AS SELECT * FROM regex_tests WHERE processed AND error_pg IS NULL AND pattern LIKE '%\\D%';
SELECT 67558
CREATE TABLE "\D diff" AS SELECT *, regexp_match(subject, '('||pattern||')', 'n') AS captured_pg_0001 FROM "\D" WHERE captured_pg IS DISTINCT FROM regexp_match(subject, '('||pattern||')', 'n');
SELECT 12
SELECT COUNT(*), pattern FROM "\D diff" GROUP BY 2 ORDER BY 1 DESC;
count | pattern
-------+----------
11 | \D
1 | [\D|\d]*
(2 rows)
Pattern 1: \D
============
This pattern is used to find the first decimal separator, normally dot (.):
SELECT subject FROM regex_tests WHERE pattern = '\D' ORDER BY RANDOM() LIMIT 10;
subject
---------------------------
1.11.00.24975645674952163
1.11.30.6944442955860683
1.12.40.38502468714280424
3.5.10.9407443094500285
1.12.40.34334381021879845
2.0.20.5175496920692813
1.8.30.09144561055484002
3.4.10.6083619758942858
3.5.10.15406771889459425
2.0.00.6309370335082272
(10 rows)
We can see how this works in almost all cases:
SELECT captured_pg, captured_v8, count(*) from regex_tests where pattern = '\D' GROUP BY 1,2 ORDER BY 3 DESC LIMIT 3;
captured_pg | captured_v8 | count
-------------+-------------+-------
{.} | {.} | 66797
| | 103
{-} | {-} | 64
(10 rows)
If we take a look at the diffs found,
all such cases have a subjects that starts with newlines:
SELECT COUNT(*), subject ~ '^\n' AS starts_with_newline FROM "\D diff" WHERE pattern = '\D' GROUP BY 2;
count | starts_with_newline
-------+---------------------
11 | t
(1 row)
Naturally, if newlines are not included, then something else will match instead.
Now, if in these cases, ignoring the newline(s) and instead proceeding
to match the first non-digit non-newline, maybe we wound find a dot (.)
like in the normal case? No, that is not the case. Instead, we will hit
some arbitrary blank space or tab:
SELECT convert_to(captured_pg[1],'utf8') AS "0001+0002", convert_to(captured_pg_0001[1],'utf8') AS "0001", COUNT(*) FROM "\D diff" WHERE pattern = '\D' GROUP BY 1,2;
0001+0002 | 0001 | count
-----------+------+-------
\x0a | \x09 | 3
\x0a | \x20 | 7
\x0a | | 1
(3 rows)
The last example where nothing at all matched, was due to the string only contained a single newline,
which couldn't be matched.
None of these outliners contain any decimal-looking-digit-sequences at all,
it's all just white space, one "€ EUR" text and some text that looks like
it's coming from some web shop's title:
SELECT ROW_NUMBER() OVER (), subject FROM "\D diff" WHERE pattern = '\D';
row_number | subject
------------+----------------------------------------------------------------
1 | +
| +
| +
|
2 | +
|
3 | +
|
4 | +
|
5 | +
| € EUR +
|
6 | +
|
7 | +
|
8 | +
|
9 | +
|
10 | +
| Dunjackor, duntäcken och dunkuddar | Joutsen Dunspecialist+
| +
| +
| +
| – Joutsen Sweden +
| +
|
11 | +
|
(11 rows)
My conclusion is all of these are nonsensical subjects when applied to the \D regex.
Out of the subjects with actual digit-sequences,
none of them starts with newlines,
so including newlines in \D wouldn't cause any effect.
I see no benefit, but also no harm, in including newlines.
Pattern 2: [\D|\d]*
===============
This looks similar to [\w\W], the author has probably not understood pipe ("|") is not needed in between bracket expression parts. The author's intention is probably to match everything in the string, like .*, but including newlines.
Patch 0002 therefore gets +1 due to this example.
===END OF PATTERNS===
My final conclusion is we should always include newlines in \D.
/Joel
From | Date | Subject | |
---|---|---|---|
Next Message | Alexey Lesovsky | 2021-02-24 16:15:14 | Re: Asynchronous and "direct" IO support for PostgreSQL. |
Previous Message | Jan Wieck | 2021-02-24 15:46:54 | Re: Extensibility of the PostgreSQL wire protocol |