Quick Links

Re: \COPY to accept non UTF-8 chars in CHAR columns

From:	Matthias Apitz <guru(at)unixarea(dot)de>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Matthias Apitz <guru(at)unixarea(dot)de>, "pgsql-generallists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject:	Re: \COPY to accept non UTF-8 chars in CHAR columns
Date:	2020-03-28 09:40:11
Message-ID:	20200328094011.GA2961@c720-r342378
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

El día sábado, marzo 28, 2020 a las 09:40:30a. m. +1300, Thomas Munro escribió:

> On Sat, Mar 28, 2020 at 4:46 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Matthias Apitz <guru(at)unixarea(dot)de> writes:
> > > In short, it there a way to let \COPY accept such broken ISO bytes, just
> > > complaining about, but not stopping the insert of the row?
> >
> > No. We don't particularly believe in the utility of invalid data.
> >
> > If you don't actually care about what encoding your data is in,
> > you could use SQL_ASCII as the database "encoding" and thereby
> > disable all UTF8-specific behavior. Otherwise, maybe this conversion
> > is a good time to clean up the mess?
>
> Something like this approach might be useful for fixing the CSV file:
>
> https://codereview.stackexchange.com/questions/185821/convert-a-mix-of-latin-1-and-utf-8-to-proper-utf-8
>
> I haven't tested that program but it looks like the right sort of
> approach; I remember writing similar logic to untangle the strange
> mixtures of Latin 1, Windows 1252, and UTF-8 that late 90s browsers
> used to send. That sort of approach can't fix every theoretical
> problem (some valid Latin1 sequences are also valid UTF-8 sequences)
> but it's doable with text in European languages.

Thomas,

This is a very good finding, thanks for this.

I played around a bit with the original code, and tested some
modifications to fit better in our problem... It works fine:

- it translates any ISO char, for example 0xfc (German Umlaut ü in ISO), into the
correct UTF-8 coding 0xc3bc:

perl -e 'print pack("H*", "4040fc4040")' | ./convert2properUTF8 | hexdump -C
00000000 40 40 c3 bc 40 40 |@@..@@|
00000006

- it translates a situation where 0xc3bc (German Umlaut ü in UTF-8
coding) was broken into two columns, one terminating in 0xc3 and the 2nd
row starting with 0xbc; this would give:

i.e. 0xc3 is translated to 0xc383 and the 2nd half, the 0xbc to
0xc2bc, both translations have nothing to do with the original split 0xc3bc, and
perhaps in this case it would be better to spill out a blank 0x40 for
each of the bytes which formed the 0xc3bc.

But this we will discuss here and align the code to our use cases.

Thanks again

matthias

--
Matthias Apitz, ✉ guru(at)unixarea(dot)de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub

In response to

Re: \COPY to accept non UTF-8 chars in CHAR columns at 2020-03-27 20:40:30 from Thomas Munro

Responses

Re: \COPY to accept non UTF-8 chars in CHAR columns at 2020-03-28 14:08:47 from Andrew Gierth

Browse pgsql-general by date

	From	Date	Subject
Next Message	Lucas Possamai	2020-03-28 12:28:52	PostegreSQL 9.2 to 9.6
Previous Message	Michael Paquier	2020-03-28 08:53:59	Re: PG12 autovac issues