From: | "Rui Martins" <Rui(dot)Martins(at)pdmfc(dot)com> |
---|---|
To: | pgsql-bugs(at)postgresql(dot)org |
Subject: | BUG #4044: Incorrect RegExp substring Output |
Date: | 2008-03-18 18:46:20 |
Message-ID: | 200803181846.m2IIkKwE078848@wwwmaster.postgresql.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
The following bug has been logged online:
Bug reference: 4044
Logged by: Rui Martins
Email address: Rui(dot)Martins(at)pdmfc(dot)com
PostgreSQL version: 8.3
Operating system: Win2000
Description: Incorrect RegExp substring Output
Details:
To setup for test, run these SQL commands
CREATE TABLE TEST_REGEXP
(
BedNo VARCHAR(20)
);
INSERT INTO TEST_REGEXP ( BedNo ) VALUES ( '123' );
INSERT INTO TEST_REGEXP ( BedNo ) VALUES ( '4325:1' );
INSERT INTO TEST_REGEXP ( BedNo ) VALUES ( '2464M' );
INSERT INTO TEST_REGEXP ( BedNo ) VALUES ( '5678M:2' );
INSERT INTO TEST_REGEXP ( BedNo ) VALUES ( '453L:1' );
From the following link
http://www.postgresql.org/docs/8.3/static/functions-matching.html
We can read the following text:
The substring function with two parameters, substring(string from pattern),
provides extraction of a substring that matches a POSIX regular expression
pattern. It returns null if there is no match, otherwise the portion of the
text that matched the pattern. But if the pattern contains any parentheses,
the portion of the text that matched the first parenthesized subexpression
(the one whose left parenthesis comes first) is returned. You can put
parentheses around the whole expression if you want to use parentheses
within it without triggering this exception. If you need parentheses in the
pattern before the subexpression you want to extract, see the non-capturing
parentheses described below.
-- -- -- -- --
No suppose we want to split up the "BedNo" column, into its parts
(DoorNumber, RoomSize and BedNumber)
SELECT BedNo,
SUBSTRING( BedNo FROM '^([[:digit:]]+)[a-zA-Z]*(:[[:digit:]]+)?$' )
AS DoorNumber,
SUBSTRING( BedNo FROM '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' )
AS RoomSize,
SUBSTRING( BedNo FROM '^[[:digit:]]+[a-zA-Z]*(:[[:digit:]]+)?$' ) AS
BedNumber
FROM TEST_REGEXP;
Or in it's other form
SELECT BedNo,
SUBSTRING( BedNo, '^([[:digit:]]+)[a-zA-Z]*(:[[:digit:]]+)?$' ) AS
DoorNumber,
SUBSTRING( BedNo, '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' ) AS
RoomSize,
SUBSTRING( BedNo, '^[[:digit:]]+[a-zA-Z]*(:[[:digit:]]+)?$' ) AS
BedNumber
FROM TEST_REGEXP;
Both will return the same result:
BedNo DoorNumber RoomSize BedNumber
123 123 123
4325:1 4325 :1
2464M 2464 M 2464M
5678M:2 5678 M :2
453L:1 453 L :1
NOTE: Tabs may note be seen correctly on a web interface.
This is clearly, not the expected result, in particular for BedNo IN (
'123', '2464M' )
The BedNumber returned in these cases is NOT logical!
However, the manual is NOT EXPLICIT in what happens, if the returned MATCHED
parentheses part is the equivalent of an empty string!
Although it states:
"But if the pattern contains any parentheses, the portion of the text that
matched the first parenthesized subexpression (the one whose left
parenthesis comes first) is returned."
Apparently, the function is returning the entire MATCHED string, instead of
just the parenthesized subexpression.
I would expect the result for BedNumber to be either NULL or the EMPTY
String, and the later seems more logical. But the documentation doesn't
state which should be returned!
Not withstanding, the expected result should be:
BedNo DoorNumber RoomSize BedNumber
123 123
4325:1 4325 :1
2464M 2464 M
5678M:2 5678 M :2
453L:1 453 L :1
NOTE: Tabs may note be seen correctly on a web interface.
-- Hack Note for this specific case !
For this specific case, we can do a hack, and change the RegExp for
BedNumber, like in the next SELECT.
SELECT BedNo,
SUBSTRING( BedNo FROM '^([[:digit:]]+)[a-zA-Z]*(:[[:digit:]]+)?$' )
AS DoorNumber,
SUBSTRING( BedNo FROM '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' )
AS RoomSize,
SUBSTRING( BedNo FROM '^[[:digit:]]+[a-zA-Z]*:([[:digit:]]+)?$' ) AS
BedNumber
FROM TEST_REGEXP;
or alternatively
SELECT BedNo,
SUBSTRING( BedNo, '^([[:digit:]]+)[a-zA-Z]*(:[[:digit:]]+)?$' ) AS
DoorNumber,
SUBSTRING( BedNo, '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' ) AS
RoomSize,
SUBSTRING( BedNo, '^[[:digit:]]+[a-zA-Z]*:([[:digit:]]+)?$' ) AS
BedNumber
FROM TEST_REGEXP
This will return NULL when there is no BedNumber, by forcing the regExp to
fail the match.
But this only works, because the format uses ":" in that specific location,
if there was no ":" anywhere, we would NOT get away so easily.
Hope to have bee of help, in finding this documentation issue and
implementation bug.
From | Date | Subject | |
---|---|---|---|
Next Message | Dave Page | 2008-03-18 19:12:13 | Re: BUG #4041: error in Application Stack Builder |
Previous Message | Alvaro Herrera | 2008-03-18 18:24:57 | Re: BUG #4043: Unrecognized node type: using plpgsql |