| From: | "John Hansen" <john(at)geeknet(dot)com(dot)au> |
|---|---|
| To: | "Hackers" <pgsql-hackers(at)postgresql(dot)org> |
| Cc: | "Patches" <pgsql-patches(at)postgresql(dot)org> |
| Subject: | Re: UNICODE characters above 0x10000 |
| Date: | 2004-08-08 03:11:24 |
| Message-ID: | 5066E5A966339E42AA04BA10BA706AE56176@rodrick.geeknet.com.au |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers pgsql-patches |
> Well, this is still working at the wrong level. The code
> that's in pg_verifymbstr is mainly intended to enforce the
> *system wide* assumption that multibyte characters must have
> the high bit set in every byte. (We do not support encodings
> without this property in the backend, because it breaks code
> that looks for ASCII characters ... such as the main
> parser/lexer ...) It's not really intended to check that the
> multibyte character is actually legal in its encoding.
>
Ok, point taken.
> The "special UTF-8 check" was never more than a very
> quick-n-dirty hack that was in the wrong place to start with.
> We ought to be getting rid of it not institutionalizing it.
> If you want an exact encoding-specific check on the
> legitimacy of a multibyte sequence, I think the right way to
> do it is to add another function pointer to pg_wchar_table
> entries to let each encoding have its own check routine.
> Perhaps this could be defined so as to avoid a separate call
> to pg_mblen inside the loop, and thereby not add any new
> overhead. I'm thinking about an API something like
>
> int validate_mbchar(const unsigned char *str, int len)
>
> with result +N if a valid character N bytes long is present
> at *str, and -N if an invalid character is present at *str
> and it would be appropriate to display N bytes in the complaint.
> (N must be <= len in either case.) This would reduce the
> main loop of pg_verifymbstr to a call of this function and an
> error-case-handling block.
>
Sounds like a plan...
> regards, tom lane
>
>
Regards,
John Hansen
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2004-08-08 03:33:13 | Re: beta time |
| Previous Message | Bruce Momjian | 2004-08-08 03:06:07 | Re: beta time |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Bruce Momjian | 2004-08-08 03:21:20 | Win32 fix for pg_dumpall |
| Previous Message | Oliver Jowett | 2004-08-08 02:35:54 | Re: [PATCHES] UNICODE characters above 0x10000 |