Re: New "raw" COPY format

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "jian he" <jian(dot)universality(at)gmail(dot)com>
Cc: "Tatsuo Ishii" <ishii(at)postgresql(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: New "raw" COPY format
Date: 2024-10-18 17:24:34
Message-ID: 7ddb1520-fa26-43d1-b122-adcea63e0602@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 18, 2024, at 15:52, jian he wrote:
> <refsect2> <title>Raw Format</title> is duplicated
> <title>Raw Format</title> didn't mention the special handling of
> end-of-data marker.

Thanks for reviewing, above fixed.

Here is a summary of the changes since v10, thanks to the feedback:

Handling of e.g. JSON and other structured text files that could contain
newlines, in a seamless way seems important, so therefore the default is
no delimiter for the raw format, so that the entire input is read as one data
value for COPY FROM, and all column data is concatenated without delimiter
for COPY TO.

When specifying a delimiter for the raw format, it separates *rows*, and can be
a multi-byte string, such as E'\r\n' to handle Windows text files.

This has been documented under the DELIMITER option, as well as under the
Raw Format section.

This also means that HEADER cannot be supported for RAW, since where there is
no delimiter, there would be no way to tell when the header line ends.

For flexibility when exporting data, there is no restriction on what characters
the column data can contain, which has been documented in this way:

When using COPY TO with raw format and a specified DELIMITER, there is no check
to prevent data values from containing the delimiter string, which could be
problematic if it would be needed to import the data preserved using COPY FROM,
since a data value containing the delimiter would then be split into two values.
If this is a concern, a different format should be used instead.

The refactoring is now in a separate first single commit, which seems
necessary, to separate the new functionality, from the refactoring.

Here are two examples on usage:

1. Example of importing/exporting entire JSON file

% cat test.json
[
{
"id" : 1,
"t_test" : "here's a \"string\""
}
]
% psql
CREATE TABLE t (c jsonb);
\COPY t FROM test.json (FORMAT raw);
SELECT * FROM t;
c
----------------------------------------------
[{"id": 1, "t_test": "here's a \"string\""}]
(1 row)
\COPY t TO test.json (FORMAT raw);
% cat test.json
[{"id": 1, "t_test": "here's a \"string\""}]%

Note: the ending "%" is just the terminal indicating there is no newline at the end,
which is intended, since there is no delimiter specified.

2. Example of importing/exporting JSONL (newline-delimited JSON)

% cat log.jsonl
{"timestamp": "2024-10-17T09:15:30Z", "level": "INFO"}
{"timestamp": "2024-10-17T09:16:10Z", "level": "ERROR"}
{"timestamp": "2024-10-17T09:17:45Z", "level": "WARNING"}
% psql
\COPY t FROM log.jsonl (FORMAT raw, DELIMITER E'\n');
SELECT * FROM t;
c
-----------------------------------------------------------------------------------------------------
{"level": "INFO", "timestamp": "2024-10-17T09:15:30Z"}
{"level": "ERROR", "timestamp": "2024-10-17T09:16:10Z"}
{"level": "WARNING", "timestamp": "2024-10-17T09:17:45Z"}
(3 rows)
\COPY t TO log.jsonl (FORMAT raw, DELIMITER E'\n');
% cat log.jsonl
{"level": "INFO", "timestamp": "2024-10-17T09:15:30Z"}
{"level": "ERROR", "timestamp": "2024-10-17T09:16:10Z"}
{"level": "WARNING", "timestamp": "2024-10-17T09:17:45Z"}

> +COPY copy_raw_test (col) FROM :'filename' RAW;
> we may need to support this.
> since we not allow
> COPY x from stdin text;
> COPY x to stdout text;
> so I think adding the RAW keyword in gram.y may not be necessary.

Nice, I didn't know text was not supported either,
so then it seems fine to only support the new syntax.

RAW keyword and the grammar removed.

> /* Complete COPY <sth> FROM|TO filename WITH (FORMAT */
> else if (Matches("COPY|\\copy", MatchAny, "FROM|TO", MatchAny,
> "WITH", "(", "FORMAT"))
> COMPLETE_WITH("binary", "csv", "text");
> src/bin/psql/tab-complete.in.c, we can also add "raw".

Nice, added.

> escape option no regress test.

Regress tests added for both ESCAPE and QUOTE, checking that they cannot be used
with TEXT, RAW nor BINARY.

> We already did column length checking at BeginCopyTo.
> no need to "if (list_length(cstate->attnumlist) != 1)" error check in
> CopyOneRowTo?

Hmm, not sure really, since DoCopy() calls both BeginCopyTo()
and DoCopyTo() which in turn calls CopyOneRowTo(),
but CopyOneRowTo() is also being called from copy_dest_receive().

/Joel

Attachment Content-Type Size
v11-0001-Refactor-ProcessCopyOptions-introduce-CopyFormat-enu.patch application/octet-stream 31.3 KB
v11-0002-Add-raw-format-to-COPY-command.patch application/octet-stream 43.7 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-10-18 17:43:45 Re: ECPG Refactor: move sqlca variable in ecpg_log()
Previous Message Nathan Bossart 2024-10-18 17:22:26 Re: Use more CppAsString2() in pg_amcheck.c