Re: substring start position behavior

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: dansonlinepresence(at)gmail(dot)com, pgsql-docs(at)lists(dot)postgresql(dot)org
Subject: Re: substring start position behavior
Date: 2024-03-06 05:42:09
Message-ID: 72911.1709703729@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-docs

Bruce Momjian <bruce(at)momjian(dot)us> writes:
> This web page explains the feature:
> https://stackoverflow.com/questions/33462061/sql-server-substring-position-negative-value
> but also asks:
> now the only question that remains is, "why would anyone need it
> to behave this way?"

Yeah. I believe our implementation adheres to the SQL spec, which
says this for <character substring function> (in SQL:2021 6.3.2):

a) If the character encoding form of <character value expression>
is UTF8, UTF16, or UTF32, then, in the remainder of this General
Rule, the term “character” shall be taken to mean “unit specified
by <char length units>”.

b) Let C be the value of the <character value expression>, let LC
be the length in characters of C, and let S be the value of the
<start position>.

c) If <string length> is specified, then let L be the value of
<string length> and let E be S+L. Otherwise, let E be the larger
of LC+1 and S.

d) If at least one of C, S, and L is the null value, then the
result of the <character substring function> is the null value.

e) If E is less than S, then an exception condition is raised:
data exception — substring error (22011). [tgl note: given c),
this happens if and only if a negative <string length> is provided.]

f) Case:

i) If S is greater than LC or if E is less than 1 (one), then
the result of the <character substring function> is the
zero-length character string.

ii) Otherwise,

1) Let S1 be the larger of S and 1 (one). Let E1 be the
smaller of E and LC+1. Let L1 be E1–S1.

2) The result of the <character substring function> is a
character string containing the L1 characters of C starting at
character number S1 in the same order that the characters
appear in C.

That's a pretty sterling example of standards-ese that is both
unreadable and devoid of any justification. But if you trace through
the possible effects of a negative S value, it looks like

(1) if L >= 0 is specified and S+L (E) is less than one, the result
is an empty string per rule f)i).

(2) if L >= 0 is specified and S+L (E) is at least one but less than
LC+1, then E is the substring end+1 position.

(3) otherwise, a negative S is disregarded and replaced by 1 so
far as the substring end calculation is concerned.

(4) in any case, a negative S is disregarded and replaced by 1 so
far as the substring start calculation is concerned.

I'm kind of inclined to not document this weirdness. I especially
don't think it's worth giving an example that neither explains the
"disregarded" bit nor highlights the dependency on L being given.

regards, tom lane

In response to

Responses

Browse pgsql-docs by date

  From Date Subject
Next Message PG Doc comments form 2024-03-06 23:42:15 Non-blocking synchronization in libpq using pipeline mode
Previous Message Bruce Momjian 2024-03-06 03:24:47 Re: substring start position behavior