Re: UNICODE characters above 0x10000

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "John Hansen" <john(at)geeknet(dot)com(dot)au>
Cc: "Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Patches" <pgsql-patches(at)postgresql(dot)org>
Subject: Re: UNICODE characters above 0x10000
Date: 2004-08-08 02:28:50
Message-ID: 4964.1091932130@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

"John Hansen" <john(at)geeknet(dot)com(dot)au> writes:
> Ahh, but that's not the case. You cannot just delete the check, since
> not all combinations of bytes are valid UTF8. UTF bytes FE & FF never
> appear in a byte sequence for instance.

Well, this is still working at the wrong level. The code that's in
pg_verifymbstr is mainly intended to enforce the *system wide*
assumption that multibyte characters must have the high bit set in
every byte. (We do not support encodings without this property in
the backend, because it breaks code that looks for ASCII characters
... such as the main parser/lexer ...) It's not really intended to
check that the multibyte character is actually legal in its encoding.

The "special UTF-8 check" was never more than a very quick-n-dirty hack
that was in the wrong place to start with. We ought to be getting rid
of it not institutionalizing it. If you want an exact encoding-specific
check on the legitimacy of a multibyte sequence, I think the right way
to do it is to add another function pointer to pg_wchar_table entries to
let each encoding have its own check routine. Perhaps this could be
defined so as to avoid a separate call to pg_mblen inside the loop, and
thereby not add any new overhead. I'm thinking about an API something
like

int validate_mbchar(const unsigned char *str, int len)

with result +N if a valid character N bytes long is present at
*str, and -N if an invalid character is present at *str and
it would be appropriate to display N bytes in the complaint.
(N must be <= len in either case.) This would reduce the main
loop of pg_verifymbstr to a call of this function and an
error-case-handling block.

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Oliver Jowett 2004-08-08 02:35:54 Re: [PATCHES] UNICODE characters above 0x10000
Previous Message Bruce Momjian 2004-08-08 02:27:51 log file rotate

Browse pgsql-patches by date

  From Date Subject
Next Message Oliver Jowett 2004-08-08 02:35:54 Re: [PATCHES] UNICODE characters above 0x10000
Previous Message Tatsuo Ishii 2004-08-08 02:17:59 Re: [PATCHES] UNICODE characters above 0x10000