Re: New "raw" COPY format

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Joel Jacobson <joel(at)compiler(dot)org>
Cc: jian he <jian(dot)universality(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: New "raw" COPY format
Date: 2024-10-28 17:50:37
Message-ID: CAD21AoA1P1JNrHxHRmzJE=GDaUEm79pYYDvG9nPzQaeJLXiwsw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 28, 2024 at 3:21 AM Joel Jacobson <joel(at)compiler(dot)org> wrote:
>
> On Mon, Oct 28, 2024, at 10:30, Joel Jacobson wrote:
> > On Mon, Oct 28, 2024, at 08:56, jian he wrote:
> >> /* Check force_quote */
> >> - if (!opts_out->csv_mode && (opts_out->force_quote ||
> >> opts_out->force_quote_all))
> >> + if (opts_out->format != COPY_FORMAT_CSV && (opts_out->force_quote ||
> >> + opts_out->force_quote_all))
> >> ereport(ERROR,
> >> (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> >> /*- translator: %s is the name of a COPY option, e.g. ON_ERROR */
> >>
> >> maybe this has a code indentation issue.
> >> since "if" and "opts_out" in the same column position.
> >
> > Thanks for review.
> >
> > I've fixed the indentation issues.
>
> I've now installed pgindent, and will use it from hereon, to avoid this class of problems.
>
> New version where all three patches are now indented using pgindent.

Thank you for updating the patch. Here are review comments on the v15
0002 patch:

When testing the patch with an empty delimiter, I got the following failure:

postgres(1:903898)=# copy hoge from '/tmp/tmp.raw' with (format 'raw',
delimiter '');
TRAP: failed Assert("delim_len > 0"), File: "copyfromparse.c", Line:
1173, PID: 903898

---
- else
+ else if (cstate->opts.format == COPY_FORMAT_TEXT)
fldct = CopyReadAttributesText(cstate);
+ else
+ {
+ elog(ERROR, "unexpected COPY format: %d", cstate->opts.format);
+ pg_unreachable();
+ }

Since we already check the incompatible options with COPY_FORMAT_RAW
and default_print, I think it's better to add an assertion to make
sure the format is either COPY_FORMAT_CSV or COPY_FORMAT_TEXT, instead
of using elog(ERROR).

---
+/*
+ * CopyReadLineRawText - inner loop of CopyReadLine for raw text mode
+ */
+static bool
+CopyReadLineRawText(CopyFromState cstate)

This function has a lot of duplication with CopyReadLineText(). I
think it's better to modify CopyReadLineText() to support 'raw'
format, rather than adding a separate function.

---
+ bool read_entire_file = (cstate->opts.delim == NULL);
+ int delim_len = cstate->opts.delim ? strlen(cstate->opts.delim) : 0;

I think we can use 'delim_len == 0' instead of read_entire_file.

---
+ if (read_entire_file)
+ {
+ /* Continue until EOF if reading entire file */
+ input_buf_ptr++;
+ continue;
+ }

In the case where we're reading the entire file as a single tuple, we
don't need to advance the input_buf_ptr one by one. Instead,
input_buf_ptr can jump to copy_buf_len, which is faster.

---
+ /* Check for delimiter, possibly multi-byte */
+ IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(delim_len - 1);
+ if (strncmp(&copy_input_buf[input_buf_ptr], cstate->opts.delim,
+ delim_len) == 0)
+ {
+ cstate->eol_type = EOL_CUSTOM;
+ input_buf_ptr += delim_len;
+ break;
+ }
+ input_buf_ptr++;

Similar to the above comment, I think we don't need to check the char
one by one. I guess that it would be faster if we locate the delimiter
string in the intput_buf (e.g. using strstr()), and then move
input_buf_ptr to the detected position.

---
+ /* Copy the entire line into attribute_buf */
+ memcpy(cstate->attribute_buf.data, cstate->line_buf.data,
+ cstate->line_buf.len);
+ cstate->attribute_buf.data[cstate->line_buf.len] = '\0';
+ cstate->attribute_buf.len = cstate->line_buf.len;

The CopyReadAttributesRaw() just copies line_buf data to
attirbute_buf, which seems to be a waste. I think we can have
attribute_buf point to the line_buf. That way, we can skip the whole
step 4 that is described in the comment on top o f copyfromparse.c:

* [data source] --> raw_buf --> input_buf --> line_buf --> attribute_buf
* 1. 2. 3. 4.

---
+static int
+CopyReadAttributesRaw(CopyFromState cstate)
+{
+ /* Enforce single column requirement */
+ if (cstate->max_fields != 1)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("COPY with format 'raw' must specify exactly
one column")));
+ }

This check should have already been done in BeginCopyFrom(). Is there
any case where max_fields gets to != 1 during reading the input?

---
It's a bit odd to me to use the delimiter as a EOL marker in raw
format, but probably it's okay.

---
- if (cstate->opts.format != COPY_FORMAT_BINARY)
+ if (cstate->opts.format == COPY_FORMAT_RAW &&
+ cstate->opts.delim != NULL)
+ {
+ /* Output the user-specified delimiter between rows */
+ CopySendString(cstate, cstate->opts.delim);
+ }
+ else if (cstate->opts.format == COPY_FORMAT_TEXT ||
+ cstate->opts.format == COPY_FORMAT_CSV)

Since it sends the delimiter as a string, even if we specify the
delimiter to '\n', it doesn't send the new line (i.e. ASCII LF, 10).
For example,

postgres(1:904427)=# copy (select '{"j" : 1}'::jsonb) to stdout with
(format 'raw', delimiter '\n');
{"j": 1}\npostgres(1:904427)=#

I think there is a similar problem in COPY FROM; if we set a delimiter
to '\n' when doing COPY FROM in raw format, it expects the string '\n'
as a line termination but not ASCII LF(10). I think that input data
normally doesn't use the string '\n' as a line termination.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2024-10-28 18:35:02 Re: pgsql: Implement pg_wal_replay_wait() stored procedure
Previous Message Bernd Helmle 2024-10-28 17:44:10 Re: [PATCH] Add sortsupport for range types and btree_gist