Re: New "raw" COPY format

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "jian he" <jian(dot)universality(at)gmail(dot)com>
Cc: "Tatsuo Ishii" <ishii(at)postgresql(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: New "raw" COPY format
Date: 2024-10-19 07:55:15
Message-ID: 2dde695a-4a30-4aea-8978-aa78c5c1648e@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 18, 2024, at 19:24, Joel Jacobson wrote:
> Attachments:
> * v11-0001-Refactor-ProcessCopyOptions-introduce-CopyFormat-enu.patch
> * v11-0002-Add-raw-format-to-COPY-command.patch

Here is a demo of a importing a decently sized real text file,
that can't currently be imported without the CSV hack:

$ head /var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64
.package-cache-mutate devel/cargo
bin admin/base-files
bin/archdetect admin/ubiquity
bin/autopartition admin/ubiquity
bin/autopartition-crypto admin/ubiquity
bin/autopartition-loop admin/ubiquity
bin/autopartition-lvm admin/ubiquity
bin/block-attr admin/ubiquity
bin/blockdev-keygen admin/ubiquity
bin/blockdev-wipe admin/ubiquity

This file uses a combination of tabs and spaces, in between the two columns,
so none of the existing formats are suitable to deal with this file.

$ ls -lah /var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64
-rw-r--r-- 1 root root 791M Apr 24 02:07 /var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64

To import using the CSV hack, we first have find two bytes that don't exist anyway,
which can be done using e.g. ripgrep. The below command verifies \x01 and \x02
don't exist anywhere:

$ rg -uuu --multiline '(?-u)[\x01|\x02]' /var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64
$

Knowing these bytes don't exist anywhere,
we can then safely use these as delimiter and quote characters,
as a hack to disable these features:

CREATE TABLE package_contents (raw_line text);

COPY package_contents FROM '/var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64' (FORMAT CSV, DELIMITER E'\x01', QUOTE E'\x02');
COPY 8443588
Time: 3882.100 ms (00:03.882)
Time: 3552.991 ms (00:03.553)
Time: 3748.038 ms (00:03.748)
Time: 3775.947 ms (00:03.776)
Time: 3729.020 ms (00:03.729)

I tested writing a Rust program that would read the file line-by-line and INSERT each line instead.
This is of course a lot slower, since it has to execute each insert separately:

$ cargo run --release
Compiling insert_package_contents v0.1.0 (/home/joel/insert_package_contents)
Finished `release` profile [optimized] target(s) in 0.70s
Running `target/release/insert_package_contents`
Connecting to the PostgreSQL database...
Successfully connected to the database.
Starting to insert lines from the file...
Successfully inserted 8443588 lines into package_contents in 134.65s.

New approach using the RAW format:

COPY package_contents FROM '/var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64' (FORMAT RAW, DELIMITER E'\n');
COPY 8443588
Time: 2918.489 ms (00:02.918)
Time: 3020.372 ms (00:03.020)
Time: 3336.589 ms (00:03.337)
Time: 3067.268 ms (00:03.067)
Time: 3343.694 ms (00:03.344)

Apart from the convenience improvement,
it seems to be somewhat faster already.

/Joel

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey M. Borodin 2024-10-19 08:41:50 Using read_stream in index vacuum
Previous Message Joel Jacobson 2024-10-19 06:52:52 Re: [BUG FIX] Fix validation of COPY options FORCE_NOT_NULL/FORCE_NULL