Re: Can we make regexp processing more friendly by recognizing "\r\n" as a "newline" for "^$" purposes?

From: Francisco Olarte <folarte(at)peoplecall(dot)com>
To: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Can we make regexp processing more friendly by recognizing "\r\n" as a "newline" for "^$" purposes?
Date: 2015-10-19 05:26:20
Message-ID: CA+bJJbzNHEqufUh=SUGJ_zSXU5TEAgdTgHqpzv_UZ9SVgg6KUg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi David:

On Sun, Oct 18, 2015 at 7:49 PM, David G. Johnston
<david(dot)g(dot)johnston(at)gmail(dot)com> wrote:
> Other implementation of regular expressions handle "newline" mechanics
> related to "^" and "$" semantically instead of literally. By that I mean
> that both "\r\n" and "\n" are considered "newlines" instead of just "\n".

Which ones ? AFAIK this kind of thing is usually done by C ( and
related ) runtimes when reading text files.

At least in my machine perl does not do it:

censored:~$ perl -e 'print( ("A\r\n" =~ /A$/) ? "matched\n" : "NO MATCH\n");'
NO MATCH
censored:~$ perl -e 'print( ("A\r\n" =~ /A.$/) ? "matched\n" : "NO MATCH\n");'
matched
censored:~$ perl -e 'print( ("A\r\n" =~ /A\s$/) ? "matched\n" : "NO MATCH\n");'
matched

Normally when reading lines in CP/M and related ( MSDOS, Windows ) the
CRT does collapse them ( and sometimes just zaps \r, or collapse any
run, or consider [\r*]\n[\r*] or.... ). But I normally do not see that
behaviour in regexes.

> If changing behavior is not desirable I would be content with another flag
> that would toggle such behavior.
> In code - both of these subqueries should match whereas presently only the
> first one does.
> SELECT regexp_matches(E'123\n', E'123$', 'w');
> SELECT regexp_matches(E'123\r\n', E'123$', 'w');
> I don't know if this is server O/S dependent...but I would not expect it to
> be so.

Neither do I ( expect it to be os dep. ) , but I find the current
behaviour correct. I mean, newline stuff is OS dependent, and you
should convert when ingesting data, when matching them it should
already have been converted to whatever the language uses for newlines
( in C and perl that means \n, which needs not be \012, BTW . In unix
\n=\012 on disk, on CP/M it's \015\012 and when I worked with Mac (
before the unixy osX they use now ) it was \015, and I cannot think on
what they can use on EBCDIC machines ).

Francisco Olarte.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Sven Löschner 2015-10-19 07:58:40 postgresql 9.4 streaming replication
Previous Message Jeff Janes 2015-10-18 23:43:27 Re: Version management for extensions