Re: Should CSV parsing be stricter about mid-field quotes?

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Joel Jacobson <joel(at)compiler(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Should CSV parsing be stricter about mid-field quotes?
Date: 2023-05-17 17:42:08
Message-ID: 62663752-ddc4-74e0-e982-ed781692eb90@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 2023-05-16 Tu 13:15, Joel Jacobson wrote:
> On Tue, May 16, 2023, at 13:43, Joel Jacobson wrote:
> >If we made midfield quoting a CSV error, those users who are
> currently mistaken
> >about their TSV/TEXT files being CSV while also having balanced
> quotes in their
> >data, would encounter an error rather than a silent failure, which I
> believe
> >would be an enhancement.
>
> Furthermore, I think it could be beneficial to add a HINT message for
> all type
> of CSV/TEXT parsing errors, since the precise ERROR messages might
> just cause
> the user to tinker with the options until it works, instead of
> carefully reading
> through the documentation on the various formats.
>
> Perhaps something like this:
>
> HINT: Are you sure the FORMAT matches your input?
>
> Also, the COPY documentation says nothing about TSV, and I know TEXT isn't
> exactly TSV, but it's at least much more TSV than CSV, so maybe we should
> describe the differences, such as \N. I think the best advise to users
> would be
> to avoid exporting to .TSV and use .CSV instead, since I've noticed e.g.
> Google Sheets to replace newlines in fields with blank space when
> exporting .TSV, which effectively destroys data.
>
> The first search results for "postgresql tsv" on Google link to
> postgresql.org
> pages, but the COPY docs are not one of them unfortunately.
>
> The first relevant hit is this one:
>
> "Importing a TSV File into Postgres | by Riley Wong" [1]
>
> Sadly, this author has also misunderstood how to properly import a
> .TSV file,
> he got it all wrong, and doesn't understand or at least doesn't
> mention there
> are more differences than just the delimiter:
>
> COPY listings
> FROM '/home/ec2-user/list.tsv'
> DELIMITER E'\t'
> CSV HEADER;
>
> I must confess I have used PostgreSQL for over two decades without
> having really
> understood the detailed differences between TEXT and CSV, until recently.
>
> [1]
> https://medium.com/@rlwong2/importing-a-tsv-file-into-postgres-364572a004bf

You can use CSV mode pretty reliably for TSV files. The trick is to use
a quoting char that shouldn't appear, such as E'\x01' as well as setting
the delimiter to E'\t'. Yes, it's far from obvious.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Laurenz Albe 2023-05-17 17:45:06 Re: psql: Could we get "-- " prefixing on the **** QUERY **** outputs? (ECHO_HIDDEN)
Previous Message Kirk Wolak 2023-05-17 17:39:12 Re: psql: Could we get "-- " prefixing on the **** QUERY **** outputs? (ECHO_HIDDEN)