Re: [PATCH] Add roman support for to_number function

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Hunaid Sohail <hunaidpgml(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>
Subject: Re: [PATCH] Add roman support for to_number function
Date: 2025-01-20 16:25:05
Message-ID: 3006318.1737390305@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hunaid Sohail <hunaidpgml(at)gmail(dot)com> writes:
> On Sat, Jan 18, 2025 at 5:27 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Initially, when I looked at the test case:
> SELECT to_number('M CC', 'RN');

> I was confused about whether it should be 'MCC'. After further inspection,
> I realized that the output is 1000 for 'M'. The format of the input can be a
> bit unclear. Perhaps we could add a note in the documentation
> or issue a warning in roman_to_int function when input is truncated,
> to clarify this behavior.

I don't think a warning is a great idea at all. There is no other
place in formatting.c that issues warnings, and this doesn't seem
like the place to start. As an example,

regression=# select to_number('123garbage', '999999');
to_number
-----------
123
(1 row)

There's certainly a case that could be made that to_number should
be stricter about trailing garbage, but it's probably a couple
decades too late to change now. In any case it seems completely
inappropriate to me for RN to adopt that policy all by itself.

A different way that we could potentially look at this example is
that RN should be willing to skip embedded spaces, leading to 'M CC'
producing 1200 not 1000. There is a little bit of support for that
idea in the existing behavior

regression=# select to_number('123 456', '999999');
to_number
-----------
123456
(1 row)

However, when you poke at that a bit closer, it's not a license
for unlimited whitespace:

regression=# select to_number('123 456', '999999');
to_number
-----------
12345
(1 row)

I've not tracked down the exact cause of that, but I think it
has to do with the fact that NUM_numpart_from_char() is willing
to eat a single leading space per call, even if it's not the
first call. The original author's reasoning for that is lost
to history thanks to the lack of comments, but I believe that
the idea was not "eat random whitespace" but "consume a space
that was emitted as a substitute for a plus sign". The fact
that it can happen once per '9' code feels more like a bug than
something intentional. I'm not proposing that we do something
about that, at least not today, but it doesn't seem like
something we want to model RN's behavior on either. So I'm
not in favor of allowing embedded spaces.

However ... in poking at this, I noticed that I had been
mis-understanding what to_char(..., 'RN') does: it fills
to 15 characters with *leading* spaces not trailing spaces
as I'd been thinking:

regression=# select to_char(433, '|RN|');
to_char
-------------------
| CDXXXIII|
(1 row)

So on the argument that to_number's RN should be willing to
consume what to_char's RN produces, we're both wrong and
roman_to_int should be willing to eat leading spaces.
What do you think?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2025-01-20 16:28:17 Re: Eager aggregation, take 3
Previous Message Umar Hayat 2025-01-20 16:19:06 Re: Add XMLNamespaces to XMLElement