Re: New "raw" COPY format

From: jian he <jian(dot)universality(at)gmail(dot)com>
To: Joel Jacobson <joel(at)compiler(dot)org>
Cc: Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: New "raw" COPY format
Date: 2024-10-18 13:52:43
Message-ID: CACJufxGWet+n+E7-ymwMxA8cFPGc65CmBpxOfT_hi9OPnou3Gg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 16, 2024 at 2:37 PM Joel Jacobson <joel(at)compiler(dot)org> wrote:
>
> On Wed, Oct 16, 2024, at 05:31, jian he wrote:
> > Hi.
> > I only checked 0001, 0002, 0003.
> > the raw format patch is v9-0016.
> > 003-0016 is a lot of small patches, maybe you can consolidate it to
> > make the review more easier.
>
> Thanks for reviewing.
>
> OK, I've consolidated the v9 0003-0016 into a single patch.
>

+ <refsect2>
+ <title>Raw Format</title>
+
+ <para>
+ This format option is used for importing and exporting files containing
+ unstructured text, where each line is treated as a single field. It is
+ ideal for data that does not conform to a structured, tabular format and
+ lacks delimiters.
+ </para>
+
+ <para>
+ In the <literal>raw</literal> format, each line of the input or output is
+ considered a complete value without any field separation. There are no
+ field delimiters, and all characters are taken literally. There is no
+ special handling for quotes, backslashes, or escape sequences. All
+ characters, including whitespace and special characters, are preserved
+ exactly as they appear in the file. However, it's important to note that
+ the text is still interpreted according to the specified
<literal>ENCODING</literal>
+ option or the current client encoding for input, and encoded using the
+ specified <literal>ENCODING</literal> or the current client
encoding for output.
+ </para>
+
+ <para>
+ When using this format, the <command>COPY</command> command must specify
+ exactly one column. Specifying multiple columns will result in an error.
+ If the table has multiple columns and no column list is provided, an error
+ will occur.
+ </para>
+
+ <para>
+ The <literal>raw</literal> format does not distinguish a
<literal>NULL</literal>
+ value from an empty string. Empty lines are imported as empty strings, not
+ as <literal>NULL</literal> values.
+ </para>
+
+ <para>
+ Encoding works the same as in the <literal>text</literal> and
<literal>CSV</literal> formats.
+ </para>
+
+ </refsect2>
+
+ <refsect2>
+ <title>Raw Format</title>
+
+ <para>
+ This format option is used for importing and exporting files containing
+ unstructured text, where each line is treated as a single field. It is
+ ideal for data that does not conform to a structured, tabular format and
+ lacks delimiters.
+ </para>
+
+ <para>
+ In the <literal>raw</literal> format, each line of the input or output is
+ considered a complete value without any field separation. There are no
+ field delimiters, and all characters are taken literally. There is no
+ special handling for quotes, backslashes, or escape sequences. All
+ characters, including whitespace and special characters, are preserved
+ exactly as they appear in the file. However, it's important to note that
+ the text is still interpreted according to the specified
<literal>ENCODING</literal>
+ option or the current client encoding for input, and encoded using the
+ specified <literal>ENCODING</literal> or the current client
encoding for output.
+ </para>
+
+ <para>
+ When using this format, the <command>COPY</command> command must specify
+ exactly one column. Specifying multiple columns will result in an error.
+ If the table has multiple columns and no column list is provided, an error
+ will occur.
+ </para>
+
+ <para>
+ The <literal>raw</literal> format does not distinguish a
<literal>NULL</literal>
+ value from an empty string. Empty lines are imported as empty strings, not
+ as <literal>NULL</literal> values.
+ </para>
+
+ <para>
+ Encoding works the same as in the <literal>text</literal> and
<literal>CSV</literal> formats.
+ </para>
+
+ </refsect2>
+
<refsect2 id="sql-copy-binary-format" xreflabel="Binary Format">
<title>Binary Format</title>

<refsect2> <title>Raw Format</title> is duplicated
<title>Raw Format</title> didn't mention the special handling of
end-of-data marker.

+COPY copy_raw_test (col) FROM :'filename' RAW;
we may need to support this.
since we not allow
COPY x from stdin text;
COPY x to stdout text;
so I think adding the RAW keyword in gram.y may not be necessary.

/* Complete COPY <sth> FROM|TO filename WITH (FORMAT */
else if (Matches("COPY|\\copy", MatchAny, "FROM|TO", MatchAny,
"WITH", "(", "FORMAT"))
COMPLETE_WITH("binary", "csv", "text");
src/bin/psql/tab-complete.in.c, we can also add "raw".

/* --- ESCAPE option --- */
if (opts_out->escape)
{
if (opts_out->format != COPY_FORMAT_CSV)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
/*- translator: %s is the name of a COPY option, e.g. ON_ERROR */
errmsg("COPY %s requires CSV mode", "ESCAPE")));
}
escape option no regress test.

/* --- QUOTE option --- */
if (opts_out->quote)
{
if (opts_out->format != COPY_FORMAT_CSV)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
/*- translator: %s is the name of a COPY option, e.g. ON_ERROR */
errmsg("COPY %s requires CSV mode", "QUOTE")));
}
escape option no regress test.

CopyOneRowTo
else if (cstate->opts.format == COPY_FORMAT_RAW)
{
int attnum;
Datum value;
bool isnull;
/* Ensure only one column is being copied */
if (list_length(cstate->attnumlist) != 1)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("COPY with format 'raw' must specify
exactly one column")));
attnum = linitial_int(cstate->attnumlist);
value = slot->tts_values[attnum - 1];
isnull = slot->tts_isnull[attnum - 1];
if (!isnull)
{
char *string = OutputFunctionCall(&out_functions[attnum - 1],
value);
CopyAttributeOutRaw(cstate, string);
}
/* For RAW format, we don't send anything for NULL values */
}
We already did column length checking at BeginCopyTo.
no need to "if (list_length(cstate->attnumlist) != 1)" error check in
CopyOneRowTo?

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthias van de Meent 2024-10-18 13:56:53 Re: Recovery of .partial WAL segments
Previous Message Laurenz Albe 2024-10-18 13:24:29 Re: Wrong security context for deferred triggers?