Re: confusing / inefficient "need_transcoding" handling in copy

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Sutou Kouhei <kou(at)clear-code(dot)com>
Cc: andres(at)anarazel(dot)de, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org, ishii(at)sraoss(dot)co(dot)jp
Subject: Re: confusing / inefficient "need_transcoding" handling in copy
Date: 2024-12-10 04:59:25
Message-ID: Z1fKrTkT-eIVAK7F@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 06, 2024 at 04:20:42PM +0900, Sutou Kouhei wrote:
> (Do you think that this patch is still needed?)

This thread has fallen off my radar, my apologies about that.

Yes, I think that this is a good thing to expand these tests. Let's
take one step at a time. I have a couple of comments.

+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
+COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP');
+ERROR: invalid byte sequence for encoding "EUC_JP": 0xe3 0x81
+CONTEXT: COPY test, line 1
+DROP TABLE test;

client_encoding would be used by COPY when not specifying ENCODING
option. Perhaps more tests should be added with this value specified
by a SET client_encoding?

Another one would be valid conversions back and forth. For example,
I recall that LATIN1 accepts any bytes and can apply a conversion to
UTF-8, so we could use it and expand a bit more the proposed tests?
Or something like that?

This is not going to be portable across the buildfarm. Two reasons
are spotted by the CI (there may be others):
1) For Windows, as in the following regression.diffs:
COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
+ERROR: character with byte sequence 0xe3 0x81 0x82 in encoding "UTF8" has no equivalent in encoding "WIN1252"
2) Second failure on Linux, with 32-bit builds:
COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
+ERROR: conversion between UTF8 and SQL_ASCII is not supported

Likely, this should be made conditional, based on the fact that the
database needs to be able to support utf8? There are a couple of
examples like that in the tree, based on the following SQL trick:
SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
\if :skip_test
\quit
\endif

This requires an alternate output for the non-utf8 case.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-12-10 05:09:31 Re: Skip collecting decoded changes of already-aborted transactions
Previous Message Dilip Kumar 2024-12-10 04:47:02 Re: Track the amount of time waiting due to cost_delay