Re: New "single" COPY format

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Joel Jacobson <joel(at)compiler(dot)org>, "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, jian he <jian(dot)universality(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: New "single" COPY format
Date: 2024-12-19 13:40:05
Message-ID: 0b70a518-f6cc-483b-8e1c-51a8585f0f72@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 2024-12-16 Mo 10:09 AM, Joel Jacobson wrote:
> Hi hackers,
>
> After further consideration, I'm withdrawing the patch.
> Some fundamental questions remain unresolved:
>
> - Should round-trip fidelity be a strict goal? By "round-trip fidelity",
> I mean that data exported and then re-imported should yield exactly
> the original values, including the distinction between NULL and empty strings.
> - If round-trip fidelity is a requirement, how do we distinguish NULL from empty
> strings without delimiters or escapes?
> - Is automatic newline detection (as in "csv" and "text") more valuable than
> the ability to embed \r (CR) characters?
> - Would it be better to extend the existing COPY options rather than introducing
> a new format?
> - Or should we consider a JSONL format instead, one that avoids the NULL/empty
> string problem entirely?
>
> No clear solution or consensus has emerged. For now, I'll step back from the
> proposal. If someone wants to revisit this later, I'd be happy to contribute.
>
> Thanks again for all the feedback and consideration.
>

We seem to have got seriously into the weeds, here. I'd be sorry to see
this dropped. After all, it's not something new, and while we have a
sort of workaround for "one json doc per line" it's far from obvious,
and except in a few blog posts undocumented.

I think we're trying to be far too general here but in the absence of
more general use cases. The ones I recall having encountered in the wild
are:

  . one json datum per line

  . one json document per file

  . a sequence of json documents per file

The last one is hard to deal with, and I think I've only seen it once or
twice, so I suggest leaving it aside for now.

Notice these are all JSON. I could imagine XML might have similar
requirements, but I encounter it extremely rarely.

Regarding NULL, an empty string is not a valid JSON literal, so there
should be no confusion there. It is valid for XML, though.

Given all that I think restricting ourselves to just the JSON cases, and
possibly just to JSONL, would be perfectly reasonable.

Regarding CR, it's not a valid character in a JSON string item, although
it is valid in JSON whitespace. I would not treat it as magical unless
it immediately precedes an NL. That gives rise to a very sight
ambiguity, but I think it's one we could live with.

As for what the format is called, I don't like the "LIST" proposal much,
even for the general case. Seems too close to an array.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Sabino Mullane 2024-12-19 13:57:47 Re: Send duration output to separate log files
Previous Message Euler Taveira 2024-12-19 13:34:39 Re: log_min_messages per backend type