Re: Allow to_date() and to_timestamp() to accept localized names

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Juan José Santamaría Flecha <juanjo(dot)santamaria(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Arthur Zakirov <zaartur(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Allow to_date() and to_timestamp() to accept localized names
Date: 2020-01-28 02:11:57
Message-ID: 7212.1580177517@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> writes:
> For the record, the correct form of that would appear to be
> select to_date('Ιανουάριος', 'TMMonth');
> with the accent. I had tried different variations of that and they all
> failed.

OK, so for anyone who is as confused as I was, the main point here
seems to be this: the upper case form of Greek sigma is 'Σ',
and the lower case form is 'σ' ... except as the final letter of
a word, where it is supposed to be written like 'ς'.

If I set lc_collate, lc_ctype, and lc_time to 'el_GR.utf8',
then (on a somewhat hoary glibc platform) I get

u8=# select to_char('2020-01-01'::timestamptz, 'TMMONTH');
to_char
----------------------
ΙΑΝΟΥΆΡΙΟΣ
(1 row)

u8=# select to_char('2020-01-01'::timestamptz, 'TMMonth');
to_char
----------------------
Ιανουάριος
(1 row)

u8=# select to_char('2020-01-01'::timestamptz, 'TMmonth');
to_char
----------------------
ιανουάριος
(1 row)

which is correct AFAICS ... but

u8=# select lower(to_char('2020-01-01'::timestamptz, 'TMMONTH'));
lower
----------------------
ιανουάριοσ
(1 row)

So what we actually have here, ISTM, is a bug in lower() not to_char().
The bug is unsurprising because str_tolower() simply applies towlower_l()
to each character independently, so there's no way for it to account for
the word-final rule. I'm not aware that glibc provides any API whereby
that could be done correctly. On the other hand, we get it right when
using an ICU collation for lower():

u8=# select lower(to_char('2020-01-01'::timestamptz, 'TMMONTH') collate "el-gr-x-icu");
lower
----------------------
ιανουάριος
(1 row)

because that code path passes the whole string to ICU at once, and
of course getting this right is ICU's entire job.

I haven't double-checked, but I imagine that the reason that to_char
gets the month name case-folding right is that what comes out of
strftime(..."%B"...) is "Ιανουάριος" which we are able to upcase
correctly, while the downcasing code paths don't affect 'ς'.

I thought for a little bit about trying to dodge this issue in the
patch by folding to upper case, not lower, before comparing month/day
names. I fear that that would just shift the problem cases to some
other language(s). However, it makes Greek better, and I think it
makes German better (does 'ß' appear in any month/day names there?),
so maybe we should just roll with that. In the end, it doesn't seem
right to reject this patch just because lower() is broken on some
platforms.

The other question your example raises is whether we should be trying
to de-accent before comparison, ie was it right for 'Ιανουάριος' to
be treated differently from 'Ιανουαριος'. I don't know enough Greek
to say, but it kind of feels like that should be outside to_date's
purview.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2020-01-28 02:31:55 Re: Condition variables vs interrupts
Previous Message Michael Paquier 2020-01-28 02:03:03 Re: BufFileRead() error signalling