From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Oliver Ford <ojford(at)gmail(dot)com> |
Cc: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Fix number skipping in to_number |
Date: | 2017-11-15 21:28:48 |
Message-ID: | 5597.1510781328@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I wrote:
> Oliver Ford <ojford(at)gmail(dot)com> writes:
>> On Monday, 13 November 2017, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> I don't follow your concern? If "$" is not the correct currency
>>> symbol for the locale, we shouldn't accept it as a match to an L format.
>>> Your patch is tightening what we will accept as a match to a G format,
>>> so I don't see why you're concerned about backward compatibility in
>>> one case but not the other.
>> It's a guess as to the likely use case. I would imagine that people are
>> likely to use a currency symbol different from the locale, but unlikely to
>> use a different group separator. Others might have a different opinion
>> though.
> Well, if they use a currency symbol different from the locale's, they're
> in trouble anyway because the number of bytes might be different. In most
> encodings, symbols other than "$" are probably not 1-byte characters.
> At the very least I think we need to constrain it enough that it not
> swallow a fractional character.
After more testing I understood your concern about L_currency_symbol:
in C locale that's " ", not "$" as I naively imagined. Americans,
at least, would be pretty unhappy if "$1234.56" suddenly stopped matching
"L9999.99". So it seems like we can't institute a strict matching rule.
However, it is certainly not good that this happens:
regression=# select to_number('1234.56', 'L9999.99');
to_number
-----------
234.56
(1 row)
To me that seems just as bad as having ',' or 'G' eat a digit.
After some reflection I propose that the rule that we want is:
* ',' and 'G' consume input only if it exactly matches the expected
separator.
* Other non-data template patterns consume a number of input characters
equal to the number of characters they'd produce in output, *except* that
these patterns will not consume data characters (digits, signs, decimal
point, comma).
I think that while we are at it we should take some measures to ensure
that "character" in this definition means "character", not "byte".
It is not good that a euro currency symbol might consume an
encoding-dependent number of input characters.
That leads me to the attached patch. There is more that could be done
here --- in particular, I'd like to see the character-not-byte-count
rule extended to literal text. But that seems like fit material for
a different patch.
Also, I noticed that in your form of the patch, the strncmp() could read
past the end of the string, possibly resulting in a crash. So I made it
use the AMOUNT_TEST infrastructure from NUM_numpart_from_char to avoid that.
One other note: I realized that it was only pure luck that your regression
test cases worked in locales where 'G' is decimal point --- they still
gave the same answer, but through a totally different interpretation of
the input. That did not seem like a good idea, so I adjusted the
regression test to force C locale for the to_number() tests. I wish we
could use some other locale here, but then it likely wouldn't be portable
to Windows.
regards, tom lane
Attachment | Content-Type | Size |
---|---|---|
apply-number-v5.patch | text/x-diff | 17.9 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2017-11-15 21:30:28 | Re: [HACKERS] Inconsistencies between pg_settings and postgresql.conf |
Previous Message | Peter Eisentraut | 2017-11-15 21:21:56 | Re: Transaction control in procedures |