From: | will trillich <will(at)serensoft(dot)com> |
---|---|
To: | Justin Clift <justin(at)postgresql(dot)org>, pgsql <pgsql-general(at)postgresql(dot)org> |
Subject: | Regular Expression INTRO [long] [OT] |
Date: | 2002-04-10 00:05:43 |
Message-ID: | 3CB38FDE.2DEA1F3F@serensoft.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
wow -- this was nearly a year ago and i still haven't
done it!
Justin Clift wrote:
> Would you mind taking a few moments and writing a guide on using Regex
> functions in PostgreSQL?
>
> This is to put on the techdocs.postgresql.org website so everyone in the
> PostgreSQL community will have access to it.
i'm posting to pgsql-general so the community
can help find holes and plug them...
here goes --
Finding data in PostgreSQL using LIKE or ~ (REGULAR EXPRESSIONS):
LIKE
====
You're probably familiar with the WHERE clause
of the SELECT statement --
SELECT somefield FROM table WHERE somefield = somevalue;
For example, to show all customers within the
post code "60201" you could try
SELECT * FROM cust WHERE zip = '60201';
But if you also have nine-digit zips stored in the zip
field (in the U.S.A. we have standard five-digit postal
codes and also a four-digit extension for further
geographic resolution) then this search will OMIT
customers having more-specific zips such as '60201-9876'
or '60201-1234'.
The Percent "%"
---------------
Try using LIKE instead of = (EQUALS):
SELECT * FROM cust WHERE zip LIKE '60201%';
With LIKE, your search values can use the percent (%)
to tell PostgreSQL that anything can occupy that spot
-- one character, twelve symbols or zero digits -- and
still satisfy the search.
On Unix or Linux, this is basically the same as the
ASTERISK (*) at the command line, when dealing with
file names:
# list all files whose names begin with '.bash'
ls .bash*
# list all files containing 'out' anywhere in the name
ls *out*
# list all file names ending with '.pl'
ls *.pl
# list file starting with 'proj', ending with '.c'
ls proj*.c
With PostgreSQL using the LIKE operator, use the
percent, instead:
-- list all customers within the 47610 postal code
SELECT * FROM cust WHERE zip LIKE '47610%';
-- display customers who have 'Corp' in their names
SELECT * FROM cust WHERE name LIKE '%Corp%';
-- show customers whose names end in 'LLC'
SELECT * FROM cust WHERE name LIKE '%LLC';
-- documents beginning with 'We', ending with 'rica'
SELECT * FROM doc WHERE contents LIKE 'We%rica';
Wherever the '%' appears (using the LIKE operator)
Postgres allows anything at all to match -- from a
lengthy string of text, to one single character, to
a zero-length string -- i.e. nothing at all.
...ILIKE 'A%Z'
-- matches 'AZ' and 'A to Z' and 'ABC2468XYZ' and 'A(at)$*Z'
The Underscore "_"
------------------
For the LIKE operator, the UNDERSCORE (_) takes on the same
meaning as the question mark does to shell operations for
Unix and Linux file names:
# list files starting with dot, followed by at least two chars
ls .??*
The underscore matches just one character exactly:
SELECT * FROM atable WHERE afield LIKE '_';
-- shows records where afield is exactly one character
-- omitting ones where it's blank or has two (or more)
-- characters in it)
You can restrict your searches to finding fields
of certain lengths this way:
SELECT * FROM cust WHERE zip LIKE '_____' OR zip LIKE '_________';
-- display all five- or nine-character zip codes
Combine it with percent to find fields over a certain length:
SELECT * FROM atable WHERE afield LIKE '___%';
-- display records where afield has three or more characters
Or you can accomodate some language idiosyncracies, as
well:
SELECT * FROM activity WHERE venue LIKE 'Theat__';
-- find u.k. 'theatre' and u.s. 'theater'
ILIKE
=====
But what if a customer's name is all capitals, as in
'THE ACME CORPORATION'? The "...LIKE '%Corp%'" won't
find it, because uppercase 'O' is not identical to
lowercase 'o'!
Finding text independent of uppercase/lowercase
is something we often need to do. And so we have
the ILIKE operator.
Use ILIKE instead of LIKE when your search should
ignore case -- the "I" stands for case-[I]nsensitive:
-- show all customers with 'corp' in the name
SELECT * FROM cust WHERE name ILIKE '%Corp%';
-- ignoring whether the field is upper- or lowercase
That'll find 'Bubba Gump Shrimp Corp' and 'ACME CORP'
and even 'Amalgamated Switch and Relay corporation'
because anything can ('%') precede or ('%') follow the
string 'Corp' which can be uppercase or lowercase.
But there's even more power in "regular expressions"!
See below...
Notes
-----
Always include some real data to search for!
-- show everything, doing a lot of unnecessary work:
SELECT * FROM atable WHERE afield LIKE '%';
It would be rather silly to make PostgreSQL compare
every record to see if it matched 'anything goes,
in "afield"'. Just say no.
Convert your spoken-language request to a LIKE
clause like this:
"afield ENDS WITH avalue"
reword => "afield starts with anything, ending with avalue"
sql => "... WHERE afield LIKE '%avalue'"
"afield STARTS WITH avalue"
reword => "afield starts with avalue, ending with anything"
sql => "... WHERE afield LIKE 'avalue%'"
See how that works?
"afield CONTAINS avalue"
sql => "... WHERE afield LIKE '%avalue%'"
If you're looking for something at the beginning of
a field, that's where your data goes, and you END with
the percent. If you're looking for something at the end
of a field, put your data there and allow anything at
the front by putting the percent there.
ALSO -- if your field is indexed, you'll defeat the index
unless you anchor your search to the beginning of the field.
-- find fields starting with 'something'
... afield LIKE 'something%'; -- uses afield's index
The example below isn't anchored at the start of the field,
so it does no good to use the index:
-- find fields containing 'pooh'
... afield LIKE '%pooh%'; -- can't use afield's index
Afield must contain 'pooh' BUT it could be anywhere in
the field. The index would be useless.
LIMITATION
==========
Okay. Let's say you have a full-name field that has
first, middle, and last name all in the one field.
Sometimes you'll have a middle name, sometimes
you won't. Sometimes the first name will be spelled
out, other times it'll be shortened or even abbreviated
to just the first initial.
Here's how you'd search that field for "Abraham Lincoln"
using the LIKE operator:
SELECT * FROM peron WHERE fullname LIKE 'A%Lincoln';
It'll find 'Abraham Lincoln' and 'A. Lincoln' and
'Abe Lincoln' because anything can go between the
initial 'A' and the ending 'Lincoln'.
But -- it'll also match 'Andrew M. Lincoln' and 'Abner
Sasquatch Lincoln' as well... and this is why some bright
souls created regular expressions!
REGULAR EXPRESSIONS
===================
A regular expression is a pattern to search for.
The structure of the pattern has to match a
rigid set of rules so that PostgreSQL will know
what you're trying to search for.
This introduction will barely scratch the surface,
but hopefully it'll get you started--
If you've ever tinkered with perl, you've probably
encountered regular expressions:
perl -ne 'print if /[A-Za-z]/;' somefile
Here, the "[A-Za-z]" is a regular expression.
It matches any line containing uppercase (A-Z)
or lowercase (a-z) letters. In this example, the
lines will come from the file "somefile" and
lines that match will be printed out.
Well, PostgreSQL can do that, too!
Example
-------
SELECT * FROM person
WHERE fullname ~ 'A(be|braham|\\.) *Lincoln';
The REGULAR EXPRESSION operator is the TILDE (~).
The value you're searching for then must be a valid
regular expression. In the above example, we're asking
PostgreSQL to display all rows containing
'A'
followed by EITHER
'be'
OR
'braham'
OR
'\\.' -- a dot
which is then followed by
' ' -- a space
'*' -- zero or more times
followed by
'Lincoln'
and that's all. Whew!
As you can guess, these will all successfully match:
'A. Lincoln'
'AbeLincoln' -- without any space at all
'Abraham Lincoln' -- with lots of spaces
and these won't match:
'A Lincoln' -- no dot, no 'braham', no 'be'
'Abe Gump Lincoln' -- nothing will match 'Gump'
Quoting "\\"
------------
So what's with the BACKSLASH in front of the dot? Well,
just as LIKE has the UNDERSCORE (_) to denote "any single
character", REGULAR EXPRESSIONS use the DOT (.) for
that very same purpose. So we have to "escape" the
dot to alter its normal interpretation, using the.
Note that string literals in PostgreSQL already use the
backslash, so you'll have to double any backslash
you use:
psql=# SELECT * FROM person
psql-# WHERE fullname ~ 'A(be|braham|\\.) *Lincoln'
psql-# -- a second backslash added
psql-# ;
Grouping "()"
-------------
How about the PARENTHESES () and the BAR |, hmm? Well,
that's how you can group permitted alternatives:
... ~ 'A(be|braham|\\.)'
That says that 'A' can be followed by any of the three
sub-expressions within the parentheses, namely 'be' or
'braham' or <dot>.
These all produce the same matches:
... ~ 'A(be|braham|\\.)'
... ~ 'Abe|Abraham|A\\.'
... ~ 'A(b(e|raham)|\\.)'
Let's dissect this last one -- it specifies that:
'A'
can be followed by either
'b' followed by some more stuff
or
<dot>
nestling into the details following the 'b' -- if
'A' is followed by 'b', then the 'b' must be followed
by either
'e' -- which matches 'Abe'
or
'raham' -- which matches 'Abraham'
Do you see how powerful this can be?
Character Classes "[]"
----------------------
So you're trying to find "McAndrews" -- but maybe it's
spelled with a lower-case "a" as in "Mcandrews", instead!
SELECT * FROM person
WHERE fullname ~ 'Mc[Aa]ndrews';
Using the BRACKETS tells the regular expression
parser that you're allowing any one of a whole class
of characters in that spot.
'[Aa]'
It specifies that you're looking for either an uppercase
'A' or a lowercase 'a' in that spot.
Here's how you can find fields containing vowels:
... ~ '[AEIOUaeiou]'
Search for fields containing lowercase letters:
... ~ '[abcdefghijklmnopqrstuvwxyz]'
Of course, there's a shortcut for specifying character
classes that cover a whole range:
... ~ '[a-z]' -- also matches any lowercase letter
Display fields that contain digits:
... ~ '[0-9]'
Here's an example using a standard U.S. phone pattern
(neglecting the area code for clarity and space):
SELECT * FROM cust
WHERE descr ~ '[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]';
That will show customers whose "descr" field contains a
pattern of digits that looks like a U.S. phone number:
'[0-9][0-9][0-9]' -- three digits
followed by
'-' -- a hyphen
followed by
'[0-9][0-9][0-9][0-9]' -- four digits
As you can see, the hyphen doesn't do anything special
for a regular expression unless it's inside the square
brackets of a character class -- in which case it
means "anything between".
What if you want to allow a hyphen within a character class?
Simply make it the first character inside the brackets:
'[-.,_]'
That class will match any hyphen, dot, comma or underscore.
Bounds "{}"
-----------
Rather than having to specify each of those [0-9]
monstrosities every time, let's abbreviate. Instead
of:
'[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]'
we can just say
'[0-9]{3}-[0-9]{4}'.
The BOUNDS specification, inside braces, comes immediately
after whatever you want a certain number of. That is,
first you specify what you're looking for, then you specify
how many you need. (The default is {1,1} one, as you can tell:
one 'A' followed by one 'b' followed by...)
Inside the BRACES that specify your "bounds" you can
say you want exactly so many, as we did above (3 and
then 4) or you can give a low-to-high pair:
... ~ 'Z{3}' -- need three Z's
... ~ '@{5,}' -- five or more @'s
... ~ 'Q{2,5}' -- at least two, at most five, Q's
... ~ '(This|That){0,3}'
The last one allows for 'This' or 'That', anywhere from
zero to three times. So 'ThisThatThis' and '' will match!
Convenient, hmm?
And for very common cases, there are handy abbreviations:
'x{1,1}' 'x' one (the default)
'r{0,1}' 'r?' zero or one, i.e. 'x' is optional
'B{0,}' 'B*' zero or more
'z{1,}' 'z+' one or more
These three abbreviations will pop up often:
-- allow 'Ms' and 'Mrs' (the 'r' is optional):
... title ~ '(Mr|Mr?s|Dr)'
-- maybe there's a space, maybe there isn't:
... lname ~ 'Mac ?Affee'
-- don't care how many trailing spaces:
... lname ~ 'Smith *'
-- find records with two or more even digits:
... afield ~ '[02468]{2,}'
or
... afield ~ '[02468][02468]+'
That last one breaks down to
'[02468]' -- an even digit
followed by
'[02468]' -- an even digit
'+' -- one or more times
Remember: the plus means '{1,}' -- i.e. that the preceeding
item must match one or more times.
Getting back to our phone number pattern... this is
the same request as above, but it's easier to
see what's going on, using the {bounds} feature:
SELECT * FROM cust
WHERE descr ~ '[0-9]{3}-[0-9]{4}';
-- three digits, hyphen, four digits
Excluding Characters "[^]"
--------------------------
Note that the above pattern will also match 00000-0000
because the last three digits of the five in the left
chunk, with the four digits of the right chunk, fit
perfectly into the constraints of what we asked for.
And this ain't no phone number.
Here's an approach to fixing that snag:
SELECT * FROM cust
WHERE descr ~ '[^0-9][0-9]{3}-[0-9]{4}';
Note that we've prefixed the previous pattern with
'[^0-9]'
because, within the square brackets of a character class,
the CARAT (^) means "anything EXCEPT..."
So now, 00000-0000 will NOT match; this is what we're
after.
But there's a new problem: if the phone number is the
very first thing in the "descr" field it would never
match because we're demanding that there be SOMETHING
(besides a digit) before the phone number. What to do?
At The Very Beginning "^"
-------------------------
You can match "beginning-of-field" with the CARAT (^).
(When it's the first thing inside square brackets, it
negates the character class; outside brackets, a carat
means "beginning-of-text". Got that? It's important!)
SELECT * FROM cust
WHERE descr ~ '(^|[^0-9])[0-9]{3}-[0-9]{4}';
Here's what this pattern looks for:
Either
'^' -- at the very beginning of text
or
'[^0-9]' -- find something that's a non-digit
followed by
'[0-9]{3}' -- three digits
and then a
'-' -- hyphen
and finally
'[0-9]{4}' -- four digits
and we've solved the 'at-beginning-of-field' problem.
At The Very End "$"
-------------------
Still, we could inadvertently match something like
000-000000000 (probably an id number for a replacement
part) which we don't want to do.
Note how this is the same problem, for the end
of the field, that we had earlier, with the beginning
of the field? The solution is similar:
SELECT * FROM cust
WHERE descr ~ '(^|[^0-9])[0-9]{3}-[0-9]{4}($|[^0-9])';
The DOLLAR SIGN ($) represents 'end-of-text' just as
carat, outside of square brackets, means 'beginning-of-
text'. The above pattern is the same as before, except
we appended
'($|[^0-9])'
to the end, which means we're looking for
either
'$' -- at the very end-of-text
or
'[^0-9]' -- something that's not a digit
after our previous pattern.
And, finally, we have a comprehensive phone pattern --
it'll match ###-#### while excluding other numeric
arrangements -- in a text field.
Anchoring
---------
Did you notice that regular expressions aren't "anchored",
as the LIKE expressions are? To specify that you're looking
only at the beginning of a field, you need to use '^' and
to include the end of a field, you must use '$'.
It's a good thing regular expressions aren't anchored
automatically -- we wouldn't have been able to specify
our phone number pattern above! Things like '(^|[^0-9])'
are possible because YOU get to specify when and how
beginning-of-text is required. Cool, eh?
And, just as in the LIKE operator, unless regular expressions
ARE anchored at the beginning of a field, you'll defeat any
index you have for that field. Indexes help alphabetize by
comparing the beginning of your fields to each other; unless
you're looking at the beginning of the field, your index can't
be used.
Here are some comparisons between operators LIKE and ~ :
-- list all customers within the 47610 postal code:
SELECT * FROM cust WHERE zip LIKE '47610%';
-- zip begins with '47610' then has anything
SELECT * FROM cust WHERE zip ~ '^47610';
-- zip begins with '47610'
-- display customers who have 'Corp' in their names
SELECT * FROM cust WHERE name LIKE '%Corp%';
-- name contains 'Corp' with anything before and after
SELECT * FROM cust WHERE name ~ 'Corp';
-- name contains 'Corp'
-- show customers whose names end in 'LLC'
SELECT * FROM cust WHERE name LIKE '%LLC';
-- name can have anything, with 'LLC' at the end
SELECT * FROM cust WHERE name ~ 'LLC$';
-- name must have 'LLC' at the end
-- documents beginning with 'We', ending with 'rica'
SELECT * FROM doc WHERE contents LIKE 'We%rica';
-- starts with 'We', has anything, ending with 'rica'
SELECT * FROM doc WHERE contents ~ '^We.*rica$';
-- start with 'We', zero or more chars, end with 'rica'
Remember that in regular expressions, DOT means 'any character'.
Thus '.*' means 'any character, zero times or more' which, in
English, means "anything can go here, including nothing at all".
There are ways to anchor your searches to word boundaries,
as well -- not just beginning-of-field and end-of-field. See
your documentation for details.
Case-Insensitive "~*"
---------------------
If you're not worried about differentiating between
uppercase and lowercase in your regular expressions,
you could go full boar and try
SELECT * FROM cust
WHERE lname ~ '[Oo][Cc][Tt][Aa][Vv][Ii][Aa][Nn]';
Fortunately there's a quickie to make it easier for you --
where you've been using the tilde (~) as your operator, use
tilde-star (~*) instead:
SELECT * FROM cust
WHERE lname ~* 'Octavian';
ILIKE is case-insensitive (where LIKE is case-specific) just
as ~* is case-insensitive for regular expressions (where ~ is
case-specific). Very handy!
Example
-------
Email addresses can look like any of these:
me(at)my(dot)net
someone-unimportant9237(at)this(dot)little(dot)org
first(dot)last(dot)title(at)obscure(dot)sub-net(dot)biggie(dot)com
_weird_(at)somewhere(dot)out_there(dot)net
A reasonably-functional email address pattern might be
something like this:
'[a-z0-9_]+([\\-\\.][a-z0-9_]+)*(at)[a-z0-9_]+([\\-\\.][a-z0-9_]+)+'
That breaks down to (using case-indifferent via ~*, of course):
'[a-z0-9_]'
any alphanumeric (including underscore) character
'+'
one or more times, followed by
'([\\-\\.][a-z0-9_]+)'
email stuff [1],
'*'
zero or more times, followed by
'@'
which is then followed by
'[a-z0-9_]'
alphanumerics (including underscore)
'+'
one or more times, followed by
'([\\-\\.][a-z0-9_]+)'
site stuff [2]
'+'
one or more times.
As for [1] email stuff, it can be (zero or more of):
'[\\-\\.]'
hyphen or dot, followed by
'[a-z0-9_]'
alphanumerics (or underscore)
'+'
one or more times.
And as for [2] site stuff, it can be (one or more of):
'[\\-\\.]'
hyphen or dot, followed by
'[a-z0-9_]'
alphanumerics (or underscore)
'+'
one or more times.
So the personal part of the address can just be a
single word, with optional dot-or-hyphen joining other
words (ZERO or more) onto it; this is followed by at-sign,
which is followed by the site portion of the address,
namely: any word, followed by ONE or more dot-or-hyphen-
followed-by-another-word combo's.
It's not a perfect match for all legal email patterns,
(for example, bad(at)my-addr would match) but something like
it might be sufficient depending on your requirements.
CONCLUSION
==========
Regular expressions are complicated, because they're
powerful! (Or is it the other way around?)
Check out "Pattern Matching" in the PostgreSQL manual
for more details -- on my Debian "Potato" system it's in
html/postgres/functions-matching.html under the
/usr/share/doc/postgresql-doc/ directory.
--
mailto:will(at)serensoft(dot)com
http://www.dontUthink.com/
From | Date | Subject | |
---|---|---|---|
Next Message | Doug McNaught | 2002-04-10 00:14:10 | Re: PostgreSQL said: ERROR: Attribute 'id' not found |
Previous Message | eric.jones | 2002-04-09 23:54:57 | PostgreSQL said: ERROR: Attribute 'id' not found |