From: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
---|---|
To: | Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, Sutou Kouhei <kou(at)clear-code(dot)com>, Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp> |
Subject: | Re: confusing / inefficient "need_transcoding" handling in copy |
Date: | 2024-02-08 08:25:07 |
Message-ID: | 7cdf4273-d7bc-4e99-aef6-6e7fdc4ecb9a@iki.fi |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 08/02/2024 09:05, Michael Paquier wrote:
> On Tue, Feb 06, 2024 at 02:24:45PM -0800, Andres Freund wrote:
>> I think the code is just very confusing - there actually *is* verification of
>> the encoding, it just happens at a different, earlier, layer, namely in
>> copyfromparse.c: CopyConvertBuf() which says:
>> /*
>> * If the file and server encoding are the same, no encoding conversion is
>> * required. However, we still need to verify that the input is valid for
>> * the encoding.
>> */
>>
>> And does indeed verify that.
>
> This has been switched by Heikki in f82de5c46bdf, in 2021, that has
> removed pg_database_encoding_max_length() in the COPY FROM case.
> (Adding him now in CC, in case he has any comments).
Yeah, I agree the "pg_database_encoding_max_length() > 1" check in COPY
TO is unnecessary. It's always been like that, but now that the COPY TO
and COPY FROM cases don't share the code, it's more obvious. Let's
remove it.
On your patch:
> + * Set up encoding conversion info, validating data if server and
> + * client encodings are not the same (see also pg_server_to_any).
There's no validation, just conversion. I'd suggest:
"Set up encoding conversion info if the file and server encodings differ
(see also pg_server_to_any)."
Other than that, +1
> That's a performance-only change, but there may be a good argument for
> backpatching something, perhaps?
-1 for backpatching, out of principle. This is not a regression, it's
always been like that. Seems innocent, but why is this different from
any other performance improvement we make.
BTW, I can see an optimization opportunity even if the encodings differ:
Currently, CopyAttributeOutText() calls pg_server_to_any(), and then
grovels through the string to find any characters that need to be
quoted. You could do it the other way round and handle quoting before
the conversion. That has two benefits:
1. You don't need the strlen() call, because you just scanned through
the string so you already know its length.
2. You don't need to worry about 'encoding_embeds_ascii' when you
operate on the server encoding.
--
Heikki Linnakangas
Neon (https://neon.tech)
From | Date | Subject | |
---|---|---|---|
Next Message | Sutou Kouhei | 2024-02-08 08:29:46 | Re: confusing / inefficient "need_transcoding" handling in copy |
Previous Message | Sutou Kouhei | 2024-02-08 08:25:01 | Re: confusing / inefficient "need_transcoding" handling in copy |