Quick Links

Re: Almost bug in COPY FROM processing of GB18030 encoded input

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Almost bug in COPY FROM processing of GB18030 encoded input
Date:	2019-01-25 12:56:27
Message-ID:	2bbaeb05-5aab-49ed-b5d0-0860e6f3eb7c@iki.fi
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 24/01/2019 23:27, Robert Haas wrote:
> On Wed, Jan 23, 2019 at 6:23 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> I happened to notice that when CopyReadLineText() calls mblen(), it
>> passes only the first byte of the multi-byte characters. However,
>> pg_gb18030_mblen() looks at the first and the second byte.
>> CopyReadLineText() always passes \0 as the second byte, so
>> pg_gb18030_mblen() will incorrectly report the length of 4-byte encoded
>> characters as 2.
>>
>> It works out fine, though, because the second half of the 4-byte encoded
>> character always looks like another 2-byte encoded character, in
>> GB18030. CopyReadLineText() is looking for delimiter and escape
>> characters and newlines, and only single-byte characters are supported
>> for those, so treating a 4-byte character as two 2-byte characters is
>> harmless.
>
> Yikes.

Committed the comment changes, so it's less of a gotcha now.

- Heikki

In response to

Re: Almost bug in COPY FROM processing of GB18030 encoded input at 2019-01-24 21:27:11 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Daniel Verite	2019-01-25 13:16:22	Re: Alternative to \copy in psql modelled after \g
Previous Message	Daniel Verite	2019-01-25 12:01:22	Re: backslash-dot quoting in COPY CSV