Re: New "raw" COPY format

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Joel Jacobson <joel(at)compiler(dot)org>
Cc: jian he <jian(dot)universality(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: New "raw" COPY format
Date: 2024-10-30 22:29:46
Message-ID: CAD21AoAOxrmVqKd7cKTdxni_B10Y8d6E+1TeZu=ZqXh+o53Kxg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 29, 2024 at 9:48 AM Joel Jacobson <joel(at)compiler(dot)org> wrote:
>
> > ---
> > It's a bit odd to me to use the delimiter as a EOL marker in raw
> > format, but probably it's okay.
> >
> > ---
> > - if (cstate->opts.format != COPY_FORMAT_BINARY)
> > + if (cstate->opts.format == COPY_FORMAT_RAW &&
> > + cstate->opts.delim != NULL)
> > + {
> > + /* Output the user-specified delimiter between rows */
> > + CopySendString(cstate, cstate->opts.delim);
> > + }
> > + else if (cstate->opts.format == COPY_FORMAT_TEXT ||
> > + cstate->opts.format == COPY_FORMAT_CSV)
> >
> > Since it sends the delimiter as a string, even if we specify the
> > delimiter to '\n', it doesn't send the new line (i.e. ASCII LF, 10).
> > For example,
> >
> > postgres(1:904427)=# copy (select '{"j" : 1}'::jsonb) to stdout with
> > (format 'raw', delimiter '\n');
> > {"j": 1}\npostgres(1:904427)=#
> >
> > I think there is a similar problem in COPY FROM; if we set a delimiter
> > to '\n' when doing COPY FROM in raw format, it expects the string '\n'
> > as a line termination but not ASCII LF(10). I think that input data
> > normally doesn't use the string '\n' as a line termination.
>
> You need to use E'\n' to get ASCII LF(10), since '\n' is just a delimiter
> consisting of backslash followed by "n".
>
> Is this a problem? Since any string can be used as delimiter,
> I think it would be strange if we parsed it and replaced the string
> with a different string.
>
> Another thought:
>
> Maybe we shouldn't default to no delimiter after all,
> maybe it would be better to default to the OS default EOL,

It seems to be useful for loading unstructured non-delimited text
files such as log files. Users can set the delimiter an empty when
loading the entire file to a single column. On the other hand, I think
that If we make it default it might not be necessary to allow the
delimiter to be multi bytes. It would be flexible but it would not
necessarily be necessary. Also, it would be somewhat odd to me that we
can use multi bytes characters as the delimiter only in 'raw' mode,
but not in 'text' mode.

> maybe a final delimiter should always be written at the end,
> so that when exporting a single json field, it would get exported
> to the text file with \n at the end, which is what most text editor
> does when saving a .json file.

Agreed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2024-10-30 22:44:55 Re: protocol-level wait-for-LSN
Previous Message Melanie Plageman 2024-10-30 21:42:21 Count and log pages set all-frozen by vacuum