Re: [PATCH] Add roman support for to_number function

From: Hunaid Sohail <hunaidpgml(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [PATCH] Add roman support for to_number function
Date: 2024-09-12 07:44:23
Message-ID: CAMWA6yYNetOJRCEGPUrA9BENPM_Dk0BvxBJsvECKqaQtR7XkxA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I have started working on the next version of the patch and have addressed
the smaller issues identified by Tom. Before proceeding further, I would
like to have opinions on some comments/questions in my previous email.

Regards,
Hunaid Sohail

On Mon, Sep 9, 2024 at 5:45 PM Hunaid Sohail <hunaidpgml(at)gmail(dot)com> wrote:

> Hi,
>
> On Sun, Sep 8, 2024 at 4:09 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> A few notes from a quick read of the patch:
>>
>> * roman_to_int() should have a header comment, notably explaining its
>> result convention. I find it fairly surprising that "0" means an
>> error --- I realize that Roman notation can't represent zero, but
>> wouldn't it be better to use -1?
>>
>
> Noted.
>
>
>>
>> * Do *not* rely on toupper(). There are multiple reasons why not,
>> but in this context the worst one is that in Turkish locales it's
>> likely to translate "i" to "İ", on which you will fail. I'd use
>> pg_ascii_toupper().
>>
>
> Noted.
>
>
>>
>> * I think roman_to_int() is under-commented internally too.
>> To start with, where did the magic "15" come from? And why
>> should we have such a test anyway --- what if the format allows
>> for trailing stuff after the roman numeral? (That is, I think
>> you need to fix this so that roman_to_int reports how much data
>> it ate, instead of assuming that it must read to the end of the
>> input.)
>
>
> MMMDCCCLXXXVIII is the longest valid standard roman numeral (15
> characters). I will add appropriate comment on length check.
>
> I am not sure I am able to understand the latter part. Could you please
> elaborate it?
> Are you referring to cases like these?
> SELECT to_number('CXXIII foo', 'RN foo');
> SELECT to_number('CXXIII foo', 'RN');
> Please check my comments on Oracle's behaviour below.
>
>
>
>> The logic for detecting invalid numeral combinations
>> feels a little opaque too. Do you have a source for the rules
>> you're implementing, and if so could you cite it?
>>
>
> There are many sources on the internet.
> A few sources:
> 1. https://www.toppr.com/guides/maths/knowing-our-numbers/roman-numerals/
> 2. https://www.cuemath.com/numbers/roman-numerals/
>
> Note that we are following the standard form:
> https://en.wikipedia.org/wiki/Roman_numerals#Standard_form
>
>
>>
>> * This code will get run through pgindent before commit, so you
>> might want to revisit whether your comments will still look nice
>> afterwards. There's not a lot of consistency in them about initial
>> cap versus lower case or trailing period versus not, too.
>>
>
> Noted.
>
>
>>
>> * roman_result could be declared where used, no?
>>
>
> Noted.
>
>
>>
>> * I'm not really convinced that we need a new errcode
>> ERRCODE_INVALID_ROMAN_NUMERAL rather than using a generic one like
>> ERRCODE_INVALID_TEXT_REPRESENTATION. However, if we do do it
>> like that, why did you pick the out-of-sequence value 22P07?
>>
>
> 22P07 is not out-of-sequence. 22P01 to 22P06 are already used.
> However, I do agree with you that we can use
> ERRCODE_INVALID_TEXT_REPRESENTATION. I will change it.
>
>
>>
>> * Might be good to have a few test cases demonstrating what happens
>> when there's other stuff combined with the RN format spec. Notably,
>> even if RN itself won't eat whitespace, there had better be a way
>> to write the format string to allow that.
>>
>
> The current patch (v3) simply ignores other formats with RN except when RN
> is used EEEE which returns error.
> ```
> postgres=# SELECT to_number('CXXIII', 'foo RN');
> to_number
> -----------
> 123
> (1 row)
>
> postgres=# SELECT to_number('CXXIII', 'MIrn');
> to_number
> -----------
> 123
> (1 row)
>
> postgres=# SELECT to_number('CXXIII', 'EEEErn');
> ERROR: "EEEE" must be the last pattern used
> ```
>
> However, I think that other formats except FM should be rejected when used
> with RN in NUMDesc_prepare function. Any opinions?
>
> If we look into Oracle, they are strict in this regard and don't allow any
> other format with RN.
> 1. SELECT to_char(12, 'MIrn') from dual;
> 2. SELECT to_char(12, 'foo RN') from dual;
> results in error:
> ORA-01481: invalid number format model
>
> I also found that there is no check when RN is used twice (both in to_char
> and to_number) and that results in unexpected output.
>
> ```
> postgres=# SELECT to_number('CXXIII', 'RNrn');
> to_number
> -----------
> 123
> (1 row)
>
> postgres=# SELECT to_char(12, 'RNrn');
> to_char
> --------------------------------
> XII xii
> (1 row)
>
> postgres=# SELECT to_char(12, 'rnrn');
> to_char
> --------------------------------
> xii xii
> (1 row)
>
> postgres=# SELECT to_char(12, 'FMrnrn');
> to_char
> ---------
> xiixii
> (1 row)
> ```
>
>
>> * Further to Aleksander's point about lack of test coverage for
>> the to_char direction, I see from
>>
>> https://coverage.postgresql.org/src/backend/utils/adt/formatting.c.gcov.html
>> that almost none of the existing roman-number code paths are covered
>> today. While it's not strictly within the charter of this patch
>> to improve that, maybe we should try to do so --- at the very
>> least it'd raise formatting.c's coverage score a few notches.
>>
>>
> I see that you have provided a patch to increase test coverage of to_char
> roman format including some other formats. I will try to add more tests for
> the to_number roman format.
>
>
> I will provide the next patch soon.
>
> Regards,
> Hunaid Sohail
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2024-09-12 08:18:57 Re: not null constraints, again
Previous Message David G. Johnston 2024-09-12 06:31:01 Re: Accept invalidation messages before the query starts inside a transaction