From: | Michael Paquier <michael(at)paquier(dot)xyz> |
---|---|
To: | Sutou Kouhei <kou(at)clear-code(dot)com> |
Cc: | andres(at)anarazel(dot)de, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org, ishii(at)sraoss(dot)co(dot)jp |
Subject: | Re: confusing / inefficient "need_transcoding" handling in copy |
Date: | 2024-12-10 04:59:25 |
Message-ID: | Z1fKrTkT-eIVAK7F@paquier.xyz |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Dec 06, 2024 at 04:20:42PM +0900, Sutou Kouhei wrote:
> (Do you think that this patch is still needed?)
This thread has fallen off my radar, my apologies about that.
Yes, I think that this is a good thing to expand these tests. Let's
take one step at a time. I have a couple of comments.
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
+COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP');
+ERROR: invalid byte sequence for encoding "EUC_JP": 0xe3 0x81
+CONTEXT: COPY test, line 1
+DROP TABLE test;
client_encoding would be used by COPY when not specifying ENCODING
option. Perhaps more tests should be added with this value specified
by a SET client_encoding?
Another one would be valid conversions back and forth. For example,
I recall that LATIN1 accepts any bytes and can apply a conversion to
UTF-8, so we could use it and expand a bit more the proposed tests?
Or something like that?
This is not going to be portable across the buildfarm. Two reasons
are spotted by the CI (there may be others):
1) For Windows, as in the following regression.diffs:
COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
+ERROR: character with byte sequence 0xe3 0x81 0x82 in encoding "UTF8" has no equivalent in encoding "WIN1252"
2) Second failure on Linux, with 32-bit builds:
COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
+ERROR: conversion between UTF8 and SQL_ASCII is not supported
Likely, this should be made conditional, based on the fact that the
database needs to be able to support utf8? There are a couple of
examples like that in the tree, based on the following SQL trick:
SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
\if :skip_test
\quit
\endif
This requires an alternate output for the non-utf8 case.
--
Michael
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2024-12-10 05:09:31 | Re: Skip collecting decoded changes of already-aborted transactions |
Previous Message | Dilip Kumar | 2024-12-10 04:47:02 | Re: Track the amount of time waiting due to cost_delay |