Re: [PATCH] Add roman support for to_number function

From: Hunaid Sohail <hunaidpgml(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [PATCH] Add roman support for to_number function
Date: 2024-09-09 12:45:17
Message-ID: CAMWA6ybooJrSMr=vUzV5aiLrH_Bj7o7Xc3Lb0eJVpqi+eNsypA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Sun, Sep 8, 2024 at 4:09 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> A few notes from a quick read of the patch:
>
> * roman_to_int() should have a header comment, notably explaining its
> result convention. I find it fairly surprising that "0" means an
> error --- I realize that Roman notation can't represent zero, but
> wouldn't it be better to use -1?
>

Noted.

>
> * Do *not* rely on toupper(). There are multiple reasons why not,
> but in this context the worst one is that in Turkish locales it's
> likely to translate "i" to "İ", on which you will fail. I'd use
> pg_ascii_toupper().
>

Noted.

>
> * I think roman_to_int() is under-commented internally too.
> To start with, where did the magic "15" come from? And why
> should we have such a test anyway --- what if the format allows
> for trailing stuff after the roman numeral? (That is, I think
> you need to fix this so that roman_to_int reports how much data
> it ate, instead of assuming that it must read to the end of the
> input.)

MMMDCCCLXXXVIII is the longest valid standard roman numeral (15
characters). I will add appropriate comment on length check.

I am not sure I am able to understand the latter part. Could you please
elaborate it?
Are you referring to cases like these?
SELECT to_number('CXXIII foo', 'RN foo');
SELECT to_number('CXXIII foo', 'RN');
Please check my comments on Oracle's behaviour below.

> The logic for detecting invalid numeral combinations
> feels a little opaque too. Do you have a source for the rules
> you're implementing, and if so could you cite it?
>

There are many sources on the internet.
A few sources:
1. https://www.toppr.com/guides/maths/knowing-our-numbers/roman-numerals/
2. https://www.cuemath.com/numbers/roman-numerals/

Note that we are following the standard form:
https://en.wikipedia.org/wiki/Roman_numerals#Standard_form

>
> * This code will get run through pgindent before commit, so you
> might want to revisit whether your comments will still look nice
> afterwards. There's not a lot of consistency in them about initial
> cap versus lower case or trailing period versus not, too.
>

Noted.

>
> * roman_result could be declared where used, no?
>

Noted.

>
> * I'm not really convinced that we need a new errcode
> ERRCODE_INVALID_ROMAN_NUMERAL rather than using a generic one like
> ERRCODE_INVALID_TEXT_REPRESENTATION. However, if we do do it
> like that, why did you pick the out-of-sequence value 22P07?
>

22P07 is not out-of-sequence. 22P01 to 22P06 are already used.
However, I do agree with you that we can use
ERRCODE_INVALID_TEXT_REPRESENTATION. I will change it.

>
> * Might be good to have a few test cases demonstrating what happens
> when there's other stuff combined with the RN format spec. Notably,
> even if RN itself won't eat whitespace, there had better be a way
> to write the format string to allow that.
>

The current patch (v3) simply ignores other formats with RN except when RN
is used EEEE which returns error.
```
postgres=# SELECT to_number('CXXIII', 'foo RN');
to_number
-----------
123
(1 row)

postgres=# SELECT to_number('CXXIII', 'MIrn');
to_number
-----------
123
(1 row)

postgres=# SELECT to_number('CXXIII', 'EEEErn');
ERROR: "EEEE" must be the last pattern used
```

However, I think that other formats except FM should be rejected when used
with RN in NUMDesc_prepare function. Any opinions?

If we look into Oracle, they are strict in this regard and don't allow any
other format with RN.
1. SELECT to_char(12, 'MIrn') from dual;
2. SELECT to_char(12, 'foo RN') from dual;
results in error:
ORA-01481: invalid number format model

I also found that there is no check when RN is used twice (both in to_char
and to_number) and that results in unexpected output.

```
postgres=# SELECT to_number('CXXIII', 'RNrn');
to_number
-----------
123
(1 row)

postgres=# SELECT to_char(12, 'RNrn');
to_char
--------------------------------
XII xii
(1 row)

postgres=# SELECT to_char(12, 'rnrn');
to_char
--------------------------------
xii xii
(1 row)

postgres=# SELECT to_char(12, 'FMrnrn');
to_char
---------
xiixii
(1 row)
```

> * Further to Aleksander's point about lack of test coverage for
> the to_char direction, I see from
>
> https://coverage.postgresql.org/src/backend/utils/adt/formatting.c.gcov.html
> that almost none of the existing roman-number code paths are covered
> today. While it's not strictly within the charter of this patch
> to improve that, maybe we should try to do so --- at the very
> least it'd raise formatting.c's coverage score a few notches.
>
>
I see that you have provided a patch to increase test coverage of to_char
roman format including some other formats. I will try to add more tests for
the to_number roman format.

I will provide the next patch soon.

Regards,
Hunaid Sohail

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2024-09-09 12:56:08 Re: Remove hardcoded hash opclass function signature exceptions
Previous Message Daniel Gustafsson 2024-09-09 12:22:19 Retire support for OpenSSL 1.1.1 due to raised API requirements