New "single" COPY format

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "PostgreSQL Hackers" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: New "single" COPY format
Date: 2024-11-07 16:15:10
Message-ID: 1db18e33-f1cf-4f2c-9d52-b6d7ff242295@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

Thread [1] renamed, since the format name has now been changed from 'raw' to
'single', as suggested by Andrew Dunstan and Jacob Champion.

[1] https://postgr.es/m/c12516b1-77dc-4ad3-94a7-88527360aee0@app.fastmail.com

Recap: This is about adding support to import/export text-based formats such as
JSONL, or any unstructured text file, where wanting to import each line "as is"
into a single column, or wanting to export a single column to a text file.

Example importing the meson-logs/testlog.json file Meson generates
when building PostgreSQL, which is in JSONL format:

# create table meson_log (log_line jsonb);
# \copy meson_log from meson-logs/testlog.json (format single);
COPY 306
# select log_line->'name' name, log_line->'result' result from meson_log limit 3;
name | result
-----------------------------------------+--------
"postgresql:setup / tmp_install" | "OK"
"postgresql:setup / install_test_files" | "OK"
"postgresql:setup / initdb_cache" | "OK"
(3 rows)

Changes since v16:

* EOL handling now works the same as for 'text' and 'csv'.
In v16, we supported multi-byte delimiters to allow specifying
e.g. Windows EOL (\r\n), but this seemed unnecessary, if we just do what we do
for text/csv, that is, to auto-detect the EOL for COPY FROM, and use
the OS default EOL for COPY TO.
The DELIMITER option is therefore invalid for the 'single' format.
This is the biggest change in the code, between v16 and v18.
CopyReadLineRawText() has been renamed to CopyReadLineSingleText(),
and changed accordingly.

* A final EOL is now emitted to the last record in COPY TO.
So now it works just like 'text' and 'csv'.

* HEADER [ boolean | MATCH ] now supported
This is now again supported, as previously suggested by Daniel Verite,
possible thanks to the EOL handling.

* Docs updated.

Below is quoted directly from the copy.sgml, but in plaintext:

---
Single Format

This format option is used for importing and exporting files containing
unstructured text, where each line is treated as a single field. It is
useful for data that does not conform to a structured, tabular format and
lacks delimiters.

In the single format, each line of the input or output is
considered a complete value without any field separation. There are no
field delimiters, and all characters are taken literally. There is no
special handling for quotes, backslashes, or escape sequences. All
characters, including whitespace and special characters, are preserved
exactly as they appear in the file. However, it's important to note that
the text is still interpreted according to the specified ENCODING
option or the current client encoding for input, and encoded using the
specified ENCODING or the current client encoding for output.

When using this format, the COPY command must specify exactly one column.
Specifying multiple columns will result in an error.
If the table has multiple columns and no column list is provided, an error
will occur.

The single format does not distinguish a NULL value from an empty string.
Empty lines are imported as empty strings, not as NULL values.

Encoding works the same as in the text and CSV formats.
---

On Fri, Nov 1, 2024, at 22:28, Masahiko Sawada wrote:
> I think it would be better to explain how to parse data in raw mode,
> especially which steps in the pipeline we skip, in the comment at the
> top of copyfromparse.c.

Good idea. I've explained it in the comment.

/Joel

Attachment Content-Type Size
v18-0001-Introduce-CopyFormat-and-replace-csv_mode-and-binary.patch application/octet-stream 18.8 KB
v18-0002-Add-COPY-format-single.patch application/octet-stream 32.3 KB
v18-0003-Reorganize-option-validations.patch application/octet-stream 19.8 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Joel Jacobson 2024-11-07 16:19:50 Re: New "raw" COPY format
Previous Message Andres Freund 2024-11-07 16:12:37 Re: Popcount optimization using AVX512