Re: Question regarding UTF-8 data and "C" collation on definition of field of table

From: Dionisis Kontominas <dkontominas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Question regarding UTF-8 data and "C" collation on definition of field of table
Date: 2023-02-06 00:48:15
Message-ID: CAB4Evu1mV53U1K-5jGxRq-u2GchP3kq3R97O11CRvQhGXfJuew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Tom,

1. Regarding the different languages in the same column, that is normal
if the column is a UTF-8 one, i.e. should be able to hold for example
English, Greek and Chinese characters. In this case what is the best
approach to define the collation and lctype of the column? Either C
or ucs_basic maybe or something else or it does not matter, the characters
would be stored correctly as long as the database is UTF8 encoding?
2. Is there a real difference between C and ucs_basic collation/lctype?
I have looked at the documentation and on the internet and most information
converges that they are actually the same as behavior is concerned;
actually no difference. What I suspect though as a non-similar is that C
after the ASCII part sorts the characters according to byte code point,
whereas the ucs_basic sorts those characters according to the Unicode code
point which may be different I suppose. Can you confirm?
3. In case the previous is correct I think that the most successful
collation for unicode columns would e to set the collation to
ucs_basic collation type and lctype accordingly if exists.

Regards,
Dionisis

On Mon, 6 Feb 2023 at 01:19, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Dionisis Kontominas <dkontominas(at)gmail(dot)com> writes:
> > I suppose that affects the outcome of ORDER BY clauses on the field,
> > along with the content of the indexes. Is this right?
>
> Yeah.
>
> > Assuming that the requirement exists, to store UTF-8 characters on a
> > field that can be from multiple languages, and the database default
> > encoding is UTF8 which is the right thing I suppose (please verify), what
> > do you think should be the values of the Collation and Ctype for the
> > database to behave correctly?
>
> Um ... so define "correct". If you have a mishmash of languages in the
> same column, it's likely that they have conflicting rules about sorting,
> and there may be no ordering that's not surprising to somebody.
>
> If there's a predominant language in the data, selecting a collation
> matching that seems like your best bet. Otherwise, maybe you should
> just shrug your shoulders and stick with C collation. It's likely
> to be faster than any alternative.
>
> regards, tom lane
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ron 2023-02-06 00:57:13 Re: Question regarding UTF-8 data and "C" collation on definition of field of table
Previous Message Tom Lane 2023-02-06 00:19:01 Re: Question regarding UTF-8 data and "C" collation on definition of field of table