Re: UTF8 national character data type support WIP patch and list of open issues.

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: MauMau <maumau307(at)gmail(dot)com>, "Boguk, Maksym" <maksymb(at)fast(dot)au(dot)fujitsu(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: UTF8 national character data type support WIP patch and list of open issues.
Date: 2013-09-18 17:18:00
Message-ID: 592.1379524680@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Mon, Sep 16, 2013 at 8:49 AM, MauMau <maumau307(at)gmail(dot)com> wrote:
>> 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always
>> contain Unicode data.
>> ...
>> 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
>> Fixed-width encoding may allow faster string manipulation as described in
>> Oracle's manual. But I'm not sure about this, because UTF-16 is not a real
>> fixed-width encoding due to supplementary characters.

> It seems to me that these two points here are the real core of your
> proposal. The rest is just syntactic sugar.

> Let me start with the second one: I don't think there's likely to be
> any benefit in using UTF-16 as the internal encoding. In fact, I
> think it's likely to make things quite a bit more complicated, because
> we have a lot of code that assumes that server encodings have certain
> properties that UTF-16 doesn't - specifically, that any byte with the
> high-bit clear represents the corresponding ASCII character.

Another point to keep in mind is that UTF16 is not really any easier
to deal with than UTF8, unless you write code that fails to support
characters outside the basic multilingual plane. Which is a restriction
I don't believe we'd accept. But without that restriction, you're still
forced to deal with variable-width characters; and there's nothing very
nice about the way that's done in UTF16. So on the whole I think it
makes more sense to use UTF8 for this.

I share Robert's misgivings about difficulties in dealing with characters
that are not representable in the database's principal encoding. Still,
you probably won't find out about many of those until you try it.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2013-09-18 17:39:04 Re: record identical operator
Previous Message Stephen Frost 2013-09-18 17:04:21 Re: record identical operator