Re: Some qualms with the current description of RegExp s,n,w modes.

From: David Johnston <david(dot)g(dot)johnston(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-docs(at)postgresql(dot)org
Subject: Re: Some qualms with the current description of RegExp s,n,w modes.
Date: 2014-06-06 00:32:38
Message-ID: CAKFQuwY=D+4wK1LpZpxXiP3p_SdEb1pMy8k5Y+sh6m9mhUFCPA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-docs

On Thu, Jun 5, 2014 at 8:00 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> David G Johnston <david(dot)g(dot)johnston(at)gmail(dot)com> writes:
> > I simplified ". and bracket expressions" to "wildcard" and "^ and $" to
> > "anchors" though did make use of ^ and $individual quite a bit. I did
> not
> > formally define these terms in the body either.
>
> Did you mean to attach a proposed doc patch here, or are you just
> armwaving about what a patch might look like?
>

​Armwaving for lack of any current setup to generate doc-patches.​

> FWIW, I don't agree with using "wildcard" to mean those particular things
> (the term is too generic, and there are other regex constructs that
> might be thought to be included); although you could probably get away
> with using "anchor" this way as long as you define the term at first use.
>
>
​I had the same nagging suspicion but figured for a first pass, and defined
only within this context, it would suffice. ". and ^ brackets" just rubbed
me the wrong way but it does have the merit of being precise.​

> The text involved here is more or less verbatim from Henry Spencer's
> original man page for the regex library, so you're essentially claiming
> you know more than the author did about what his code is good for. Maybe
> so, but some examples in support of your thesis would be a good thing.
>

​I can readily support why I found [w] to be most useful; the conclusion
that [w] > [s] came from the logic that making "^ and $" useless means that
using [w] mode and simply avoiding using them would have the same effect.
I'll admit that people using ^ and $ where they really meant \A and \Z may
be an issue worth accounting for...but I personally call providing that
mode to be a compatibility/help-oriented decision and just decided to state
so in my revision.

Example that prompted this whole journey:

WITH src (filecontent) AS ( VALUES(
$$CDF CORR: DRAIN COOLANT AND REFILL
ADDITIONAL DLR-OP: BGFLDEX
PAY TYPE: C OTH HRS: 0000 FORECAST SERVICE: CHG
TO: EPA CHG: HAZ CHG:
9999 5
SPG CONVERSION SETTINGS - SPG MFG: -- GEN MOD: -- VIN/MODEL#:
ENGINE:

CDR CORR: CUSTOMER ELECTED NOT TO HAVE REPAIRS DONE AT THIS TIME
NOS
PAY TYPE: C OTH HRS: 0000 FORECAST SERVICE: CHG
TO: EPA CHG: HAZ CHG:
9999 03 0030
SPG CONVERSION SETTINGS - SPG MFG: -- GEN MOD: -- VIN/MODEL#:
ENGINE:
$$::varchar
))
, do_match AS (
SELECT regexp_matches(filecontent,'^(\S.*?)(?=^\S|\Z)','gw') AS match FROM
src
)
, explode_match AS (
SELECT unnest(match) FROM do_match
)
SELECT unnest, length(unnest) FROM explode_match;

[s] 1 result because the "^\S" construct attempts to match
beginning-of-document instead of beginning-of-line. This is when I started
digging deeper since I expected it to behave like [w].
[n] 0 results because the (.*?) never gets beyond the first line and thus
cannot match "^\S|\Z" - no problem here, the behavior of "." is as expected.
[w] 2 results as desired/expected. It is possible to replace ^\S with \n\S
(and thus allow [s] to work) but the semantic meaning of ^ makes using this
form more convenient

Note that CDF has 5 rows of content while CDR only has 4; thus strongly
suggesting the use of newline-insensitive "wildcard" matching. The choice
of anchor mode is of a cosmetic/semantic nature but I argue that in this
situation the semantic of [w] are preferred over [n].

In either case I'd rather simply drop the existing commentary that [w] is
not that useful and either in words or example explain when it would have
use; even if you do not want to go as far as to claim that [w] is superior
to [n] as I would.

While it is likely possible to write a working expression in all three
modes my experience - which is largely based in executing these expressions
in Java, not PostgreSQL thought that is becoming more common nowadays - led
me directly to the regexp provided.

> > Instead of calling these "partial" and "inverse partial" better terms
> would
> > be "newline-sensitive wildcard matching" and "newline-sensitive anchor
> > matching".
>
> Agreed that "partial" is not a very good name, but I remain resistant to
> "wildcard" here.
>
> > The default mode could be called "newline-sensitive full
> > matching".
>
> Or just "newline-sensitive matching" ... does "full" add anything?
>
>
​Not much - though after adding "anchor" and "wildcard" to the others the
question became if this option is not only one of those then is it both, or
neither? Full makes it clear that it means both.

Maybe something like: [s] - single-line mode; [w] - multi-line mode; [n|m]
- document-only mode; though I dislike re-associating multi-line with [w]
given its current association with [n|m]. "Record Mode [w]" has some merit
since that is at least the use case that I have identified where it is
particularly useful...

David J.

In response to

Responses

Browse pgsql-docs by date

  From Date Subject
Next Message David Johnston 2014-06-06 00:56:35 Re: Some qualms with the current description of RegExp s,n,w modes.
Previous Message Tom Lane 2014-06-06 00:00:45 Re: Some qualms with the current description of RegExp s,n,w modes.