Re: Can we make regexp processing more friendly by recognizing "\r\n" as a "newline" for "^$" purposes?

From: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To: Francisco Olarte <folarte(at)peoplecall(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Can we make regexp processing more friendly by recognizing "\r\n" as a "newline" for "^$" purposes?
Date: 2015-10-19 13:15:05
Message-ID: CAKFQuwa40BAHY9r4uE82=s2CZPN3FcA_bALDAAcnOyH_OdrauQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, Oct 19, 2015 at 1:26 AM, Francisco Olarte <folarte(at)peoplecall(dot)com>
wrote:

> Hi David:
>
> On Sun, Oct 18, 2015 at 7:49 PM, David G. Johnston
> <david(dot)g(dot)johnston(at)gmail(dot)com> wrote:
> > Other implementation of regular expressions handle "newline" mechanics
> > related to "^" and "$" semantically instead of literally. By that I mean
> > that both "\r\n" and "\n" are considered "newlines" instead of just "\n".
>
> Which ones ? AFAIK this kind of thing is usually done by C ( and
> related ) runtimes when reading text files.
>
>
​In particular, Java.

There is a third-party program I use, RegEx Buddy, that also behaves in the
way described.

At least in my machine perl does not do it:
>
> censored:~$ perl -e 'print( ("A\r\n" =~ /A$/) ? "matched\n" : "NO
> MATCH\n");'
> NO MATCH
> censored:~$ perl -e 'print( ("A\r\n" =~ /A.$/) ? "matched\n" : "NO
> MATCH\n");'
> matched
> censored:~$ perl -e 'print( ("A\r\n" =~ /A\s$/) ? "matched\n" : "NO
> MATCH\n");'
> matched
>

​Yes; and I find this to be an annoyance as well...

>
> Normally when reading lines in CP/M and related ( MSDOS, Windows ) the
> CRT does collapse them ( and sometimes just zaps \r, or collapse any
> run, or consider [\r*]\n[\r*] or.... ). But I normally do not see that
> behaviour in regexes.
>
> > If changing behavior is not desirable I would be content with another
> flag
> > that would toggle such behavior.
> > In code - both of these subqueries should match whereas presently only
> the
> > first one does.
> > SELECT regexp_matches(E'123\n', E'123$', 'w');
> > SELECT regexp_matches(E'123\r\n', E'123$', 'w');
> > I don't know if this is server O/S dependent...but I would not expect it
> to
> > be so.
>
> Neither do I ( expect it to be os dep. ) , but I find the current
> behaviour correct. I mean, newline stuff is OS dependent, and you
> should convert when ingesting data, when matching them it should
> already have been converted to whatever the language uses for newlines
> ( in C and perl that means \n, which needs not be \012, BTW . In unix
> \n=\012 on disk, on CP/M it's \015\012 and when I worked with Mac (
> before the unixy osX they use now ) it was \015, and I cannot think on
> what they can use on EBCDIC machines ).
>
>
The current behavior is correct. The behavior I describe, however, would
be more user-friendly​ without being "incorrect".

​Having started with, and still reliant upon external sources that use,
Windows I've been (un)fortunate to develop habits where 99% of the time I
do not have to care about line endings during the processing of data. I'll
pick up new habits eventually but not having to deal with a pre-process
line-ending conversion step would make ad-hoc use of the PostgreSQL regex
engine (TCL's) less cumbersome.

I'm hoping that Tom Lane at least chimes with his opinion given his recent
work that area of the codebase is at least fresh in his mind. Its not a
huge deal but recent pain motivates me to at least put it out there.

David J.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2015-10-19 14:38:32 Re: ERROR: tablespace "archive2" is not empty
Previous Message Yves Dorfsman 2015-10-19 12:49:57 Re: PSQL Tools