Re: [PATCH] Add roman support for to_number function

From: Hunaid Sohail <hunaidpgml(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>
Subject: Re: [PATCH] Add roman support for to_number function
Date: 2025-01-20 07:32:51
Message-ID: CAMWA6yaXq9=7rJdS9LVkmYB0d-AYt3pLC0hpBe9+stnM7W+SQw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tom,

On Sat, Jan 18, 2025 at 5:27 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Hunaid Sohail <hunaidpgml(at)gmail(dot)com> writes:
> > I’ve attached a new patch that addresses comments 2, 3, and 4 from Tomas.
>
> I'm still quite unhappy that this doesn't tolerate trailing
> whitespace. That's not how other format patterns work, and it makes
> it impossible to round-trip the output of to_char(..., 'RN').
> I think the core of the problem is that you tried to cram the logic
> in where the existing "not implemented" error is thrown. But on
> inspection there is nothing sane about that entire "Roman correction"
> stanza -- what it does is either useless (zeroing already-zeroed
> fields) or in the wrong place (if we don't want to allow other flags
> with IS_ROMAN, that should be done via an error in NUMDesc_prepare,
> like other similar cases). In the attached I moved the roman_to_int
> call into NUM_processor's main loop so it's more like other format
> patterns, and fixed it to not eat any more characters than it should.
> This might allow putting back some format-combination capabilities,
> but other than the whitespace angle I figure we can leave that for
> people to request it.
>

Thank you for the tweaks to the patch. Overall, it looks great.

Initially, when I looked at the test case:
SELECT to_number('M CC', 'RN');

I was confused about whether it should be 'MCC'. After further inspection,
I realized that the output is 1000 for 'M'. The format of the input can be a
bit unclear. Perhaps we could add a note in the documentation
or issue a warning in roman_to_int function when input is truncated,
to clarify this behavior.

+ bool truncated = false;
+
+ /*
+ * Collect and decode valid roman numerals, consuming at most
+ * MAX_ROMAN_LEN characters. We do this in a separate loop to avoid
+ * repeated decoding and because the main loop needs to know when
it's at
+ * the last numeral.
+ */
+ for (len = 0; len < MAX_ROMAN_LEN && !OVERLOAD_TEST; len++)
+ {
+ char currChar = pg_ascii_toupper(*Np->inout_p);
+ int currValue = ROMAN_VAL(currChar);
+
+ if (currValue == 0)
+ {
+ truncated = true;
+ break; /* Not a valid
roman numeral. */
+ }
+ romanChars[len] = currChar;
+ romanValues[len] = currValue;
+ Np->inout_p++;
+ }
+
+ if (len == 0)
+ return -1; /* No valid roman
numerals. */
+
+ /* Check for truncation. */
+ if (truncated)
+ {
+ ereport(WARNING,
+
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("Input truncated to \"%.*s\"",
+ len, romanChars)));
+ }

Output after change:

postgres=# SELECT to_number('MMXX ', 'RN');
WARNING: Input truncated to "MMXX"
to_number
-----------
2020
(1 row)

postgres=# SELECT to_number(' XIV ', ' RN');
WARNING: Input truncated to "XIV"
to_number
-----------
14
(1 row)

postgres=# SELECT to_number('M CC', 'RN');
WARNING: Input truncated to "M"
to_number
-----------
1000
(1 row)

Also, some modifications to the test cases will be required
if we go with these changes.

Regards,
Hunaid Sohail

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mats Kindahl 2025-01-20 07:33:12 Re: Coccinelle for PostgreSQL development [1/N]: coccicheck.py
Previous Message Guillaume Lelarge 2025-01-20 07:24:54 Re: improve DEBUG1 logging of parallel workers for CREATE INDEX?