Quick Links

Re: Allow to_date() and to_timestamp() to accept localized names

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Juan José Santamaría Flecha <juanjo(dot)santamaria(at)gmail(dot)com>
Cc:	Arthur Zakirov <zaartur(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Allow to_date() and to_timestamp() to accept localized names
Date:	2020-01-21 23:20:24
Message-ID:	31691.1579648824@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= <juanjo(dot)santamaria(at)gmail(dot)com> writes:
> [ 0001-Allow-localized-month-names-to_date-v6.patch ]

I took a quick look through this.

One thing I completely don't understand is why it's sufficient for
seq_search_localized to do "initcap" semantics. Shouldn't it cover
the all-upper and all-lower cases as well, as the existing seq_search
code does? (That is, why is it okay to ignore the "type" argument?)

On the other hand, you probably *should* be ignoring the "max"
argument, because AFAICS the values that are passed for that are
specific to the English ASCII spellings of the day and month names.
It seems very likely that they could be too small for some sets of
non-English names.

Related to that, the business with

+ mb_max = max * pg_encoding_max_length(encoding);
+ name_len = strlen(name);
+ name_len = name_len < mb_max ? name_len : mb_max;

seems wrong even if we assume that "max" is a sane limit on the number of
characters to consider. This coding is likely to truncate a long "name"
string in the middle of a multibyte character, leading to bad-encoding
errors from later functions.

I also am confused by the "if (mb_max > max ...)" bit. It looks to me
like that's an obscure way of writing "if (pg_encoding_max_length() > 1)",
which is a rather pointless check considering that the if() then goes on
to insist on encoding == PG_UTF8.

A bit further down, you have an "if (name_wlen > norm_wlen)"
condition that seems pretty fishy. Is it really guaranteed that
unicode_normalize_kc cannot transform the string without shortening it?
I don't see that specified in its header comment, for sure.

Also, it's purely accidental if this:

norm_name = (char *) palloc((norm_wlen + 1) * sizeof(pg_wchar));

allocates a long enough string for the conversion back to multibyte form,
because that output is not pg_wchars. I think you want something more
like "norm_wlen * MAX_MULTIBYTE_CHAR_LEN + 1". (I wonder whether we
need to be worrying about integer overflow in any of this.)

It seems like you could eliminate a lot of messiness by extending
localized_abbrev_days[] and related arrays by one element that's
always NULL. That would waste, hmm, four 8-byte pointers on typical
machines --- but you're eating a lot more than 32 bytes of code to
pass around the array lengths, plus it's really ugly that the plain
and localized arrays are handled differently.

I don't think the way you're managing the "localized_names" variable
in DCH_from_char is very sensible. The reader has to do a lot of
reverse engineering just to discover that the value is constant NULL
in most references, something that you've not helped by declaring
it outside the loop rather than inside. I think you could drop
the variable and write constant NULL in most places, with something
like
S_TM(n->suffix) ? localized_full_days : NULL
in those from_char_seq_search calls where it's relevant.

In general I'm displeased with the lack of attention to comments.
Notably, you added arguments to from_char_seq_search without updating
its header comment ... not that that comment is a great API spec, but
at the least you shouldn't be making it worse. I think the bug I
complained of above is directly traceable to the lack of a clear spec
here for what "max" means, so failure to keep these comments up to
speed does have demonstrable bad consequences.

regards, tom lane

In response to

Re: Allow to_date() and to_timestamp() to accept localized names at 2020-01-15 18:21:11 from Juan José Santamaría Flecha

Responses

Re: Allow to_date() and to_timestamp() to accept localized names at 2020-01-22 22:11:27 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tomas Vondra	2020-01-21 23:41:41	Re: proposal: schema variables
Previous Message	Alvaro Herrera	2020-01-21 22:37:26	Re: Psql patch to show access methods info