Issue with loading unicode characters with copy command

From: Kiran K V <kirankv(dot)1982(at)gmail(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Issue with loading unicode characters with copy command
Date: 2024-01-12 15:23:06
Message-ID: CAGv0ivrN1eEwSiaJ+Ey0fVPuGg47rO1u3gUWtm3Taw1+3qzRRQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

I have a UTF8 database and simple table with two columns (integer and
varchar). Created a csv file with some multibyte characters and trying to
perform load operation using the copy command.

Database info:

Postgresql database details:

Name | Owner | Encoding | Collate | Ctype
| Access privileges

-----------+----------+----------+--------------------+--------------------+-----------------------

postgres | postgres | UTF8 | English_India.1252 | English_India.1252 |

(Note: I also tried with collate utf8 and no luck)

postgres=# set client_encoding='UTF8';

SET

Table:

create table public.test ( PKCOL integer not null, STR1 character
varying(64) null, primary key( PKCOL ))

csv contents:

1|"àáâãäåæçèéêëìíîï"

After data loading, actual data is becoming

à áâãäåæçèéêëìÃîï

hex of this is - c2a1c2a2c2a3c2a4c2a5c2a6c2a7c2a8c2a9c2aac2abc2acc2aec2af

The hex values are indeed the UTF-8 encodings of the characters in your
expected string, and the presence of `C2` before each character is
indicative of how UTF-8 represents certain characters.

In UTF-8, characters from the extended Latin set (like `à`, `á`, `â`, etc.)
are represented as two bytes. The first byte `C2` or `C3` indicates that
this is a two-byte character, and the second byte specifies the character.
For example:

- `à` is represented as `C3 A0`

- `á` is `C3 A1`

- `â` is `C3 A2`, and so on.

In this case, the `C2` byte is getting interpreted as a separate character
and that is the likely reason that an `Â` (which corresponds to `C2`) is
seen before each intended character. Looks like UTF-8 encoded data is
mistakenly interpreted as Latin-1 (ISO-8859-1) or Windows-1252, where each
byte is treated as a separate character.

Please advise. Thank you very much.

Regards,

Kiran

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ron Johnson 2024-01-12 15:48:29 Re: How much size saved by updating column to NULL ?
Previous Message David G. Johnston 2024-01-12 13:58:48 Re: How much size saved by updating column to NULL ?