New "raw" COPY format

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: New "raw" COPY format
Date: 2024-10-11 20:29:15
Message-ID: c12516b1-77dc-4ad3-94a7-88527360aee0@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

This thread is about implementing a new "raw" COPY format.

This idea came up in a different thread [1], moved here.

[1] https://postgr.es/m/47b5c6a7-5c0e-40aa-8ea2-c7b95ccf296f%40app.fastmail.com

The main use-case for the raw format, is when needing to import arbitrary
unstructured text files, such as log files, into a single text column
of a table.

The name "raw" is just a working title. Andrew had some other good name ideas:
> WFM, so something like FORMAT {SIMPLE, RAW, FAST, SINGLE}?

Below is the draft of its description, sent previously [1],
adjusted thanks to feedback from Daniel Verite, who made me realize the
HEADER option should be made available also for this format.

--- START OF DESCRIPTION ---

Raw Format

The "raw" format is used for importing and exporting files containing
unstructured text, where each line is treated as a single field. This format
is ideal when dealing with data that doesn't conform to a structured,
tabular format and lacks delimiters.

Key Characteristics:

- No Field Delimiters:
Each line is considered a complete value without any field separation.

- Single Column Requirement:
The COPY command must specify exactly one column when using the raw format.
Specifying multiple columns will result in an error.

- Literal Data Interpretation:
All characters are taken literally.
There is no special handling for quotes, backslashes, or escape sequences.

- No NULL Distinction:
Empty lines are imported as empty strings, not as NULL values.

Notes:

- Error Handling:
An error will occur if you use the raw format without specifying exactly
one column or if the table has multiple columns and no column list is
provided.

- Data Preservation:
All characters, including whitespace and special characters, are preserved
exactly as they appear in the file.

--- END OF DESCRIPTION ---

After having studied the code that will be affected,
I feel that before making any changes, I would like to try to improve
ProcessCopyOptions, in terms of readability and maintainability, first.

This seems possible by just reorganize it a bit.

It is actually already organized quite nicely, where the code is mostly
organized per-option, but not always, as it sometimes is spread across
different sections.

It seems possible to organize even more of it per-option,
which would make it easier to reason about each option separately.

This seems possible by organizing the checks per option,
under a single if-branch per option, and moving the setting
of defaults per option (when applicable) to the corresponding
else-branch.

This would also avoid setting defaults for options that are not applicable
for a given format, and instead let their initial NULL value remain untouched,
rather than setting unnecessary defaults.

Some of the checks depend on multiple options in an interdependent way,
not belonging to a specific option more than another. I think such checks
would be nice to place at the end under a separate section.

I also think it would be more readable to use the existing bool variables
named [option]_specified, to determine if an option has been set,
rather than relying on the option's default enum value to evaluate to false.

The attached patch implements the above ideas.

I think with these changes, it would be easier to hack on new and existing
copy options and formats.

/Joel

Attachment Content-Type Size
v1-0001-Replace-binary-flags-binary-and-csv_mode-with-format.patch application/octet-stream 18.0 KB
v1-0002-Reorganize-ProcessCopyOptions-for-clarity-and-consis.patch application/octet-stream 19.0 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joel Jacobson 2024-10-11 21:01:10 Re: New "raw" COPY format
Previous Message Joel Jacobson 2024-10-11 19:53:09 Re: Should CSV parsing be stricter about mid-field quotes?