Quick Links

Re: UTF8 national character data type support WIP patch and list of open issues.

From:	"MauMau" <maumau307(at)gmail(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Boguk, Maksym" <maksymb(at)fast(dot)au(dot)fujitsu(dot)com>
Cc:	"Heikki Linnakangas" <hlinnakangas(at)vmware(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: UTF8 national character data type support WIP patch and list of open issues.
Date:	2013-09-16 12:49:52
Message-ID:	B1A7485194DE4FDAB8FA781AFB570079@maumau
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hello,

I think it would be nice for PostgreSQL to support national character types
largely because it should ease migration from other DBMSs.

[Reasons why we need NCHAR]
--------------------------------------------------
1. Invite users of other DBMSs to PostgreSQL. Oracle, SQL Server, MySQL,
etc. all have NCHAR support. PostgreSQL is probably the only database out
of major ones that does not support NCHAR.
Sadly, I've read a report from some Japanese government agency that the
number of MySQL users exceeded that of PostgreSQL here in Japan in 2010 or
2011. I wouldn't say that is due to NCHAR support, but it might be one
reason. I want PostgreSQL to be more popular and regain those users.

2. Enhance the "open" image of PostgreSQL by implementing more features of
SQL standard. NCHAR may be a wrong and unnecessary feature of SQL standard
now that we have Unicode support, but it is defined in the standard and
widely implemented.

3. I have heard that some potential customers didn't adopt PostgreSQL due to
lack of NCHAR support. However, I don't know the exact reason why they need
NCHAR.

4. I guess some users really want to continue to use ShiftJIS or EUC_JP for
database encoding, and use NCHAR for a limited set of columns to store
international text in Unicode:
- to avoid code conversion between the server and the client for performance
- because ShiftJIS and EUC_JP require less amount of storage (2 bytes for
most Kanji) than UTF-8 (3 bytes)
This use case is described in chapter 6 of "Oracle Database Globalization
Support Guide".
--------------------------------------------------

I think we need to do the following:

[Minimum requirements]
--------------------------------------------------
1. Accept NCHAR/NVARCHAR as data type name and N'...' syntactically.
This is already implemented. PostgreSQL treats NCHAR/NVARCHAR as synonyms
for CHAR/VARCHAR, and ignores N prefix. But this is not documented.

2. Declare support for national character support in the manual.
1 is not sufficient because users don't want to depend on undocumented
behavior. This is exactly what the TODO item "national character support"
in PostgreSQL TODO wiki is about.

3. Implement NCHAR/NVARCHAR as distinct data types, not as synonyms so that:
- psql \d can display the user-specified data types.
- pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is, not as
CHAR/VARCHAR.
- To implement additional features for NCHAR/NVARCHAR in the future, as
described below.
--------------------------------------------------

[Optional requirements]
--------------------------------------------------
1. Implement client driver support, such as:
- NCHAR host variable type (e.g. "NCHAR var_name[12];") in ECPG, as
specified in the SQL standard.
- national character methods (e.g. setNString, getNString,
setNCharacterStream) as specified in JDBC 4.0.
I think at first we can treat these national-character-specific features as
the same as CHAR/VARCHAR.

2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always
contain Unicode data.
I think it is sufficient at first that NCHAR/NVARCHAR columns can only be
used in UTF-8 databases and they store UTF-8 strings. This allows us to
reuse the input/output/send/recv functions and other infrastructure of
CHAR/VARCHAR. This is a reasonable compromise to avoid duplication and
minimize the first implementation of NCHAR support.

3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
Fixed-width encoding may allow faster string manipulation as described in
Oracle's manual. But I'm not sure about this, because UTF-16 is not a real
fixed-width encoding due to supplementary characters.
--------------------------------------------------

I don't think it is good to implement NCHAR/NVARCHAR types as extensions
like contrib/citext, because NCHAR/NVARCHAR are basic types and need
client-side support. That is, client drivers need to be aware of the fixed
NCHAR/NVARCHAR OID values.

How do you think we should implement NCHAR support?

Regards
MauMau

In response to

Re: UTF8 national character data type support WIP patch and list of open issues. at 2013-09-04 14:28:42 from Tom Lane

Responses

Re: UTF8 national character data type support WIP patch and list of open issues. at 2013-09-17 12:43:14 from Arulappan, Arul Shaji
Re: UTF8 national character data type support WIP patch and list of open issues. at 2013-09-18 13:16:11 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Heikki Linnakangas	2013-09-16 13:13:57	Re: patch: add MAP_HUGETLB to mmap() where supported (WIP)
Previous Message	Andrew Gierth	2013-09-16 12:13:27	Re: Fix picksplit with nan values