Re: New "raw" COPY format

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Masahiko Sawada" <sawada(dot)mshk(at)gmail(dot)com>
Cc: "jian he" <jian(dot)universality(at)gmail(dot)com>, "Tatsuo Ishii" <ishii(at)postgresql(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: New "raw" COPY format
Date: 2024-10-29 16:48:27
Message-ID: 727a4968-a30d-4f64-9fbe-c1a43d5fed71@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 28, 2024, at 18:50, Masahiko Sawada wrote:
> Thank you for updating the patch. Here are review comments on the v15
> 0002 patch:

Thanks for review.

> When testing the patch with an empty delimiter, I got the following failure:
>
> postgres(1:903898)=# copy hoge from '/tmp/tmp.raw' with (format 'raw',
> delimiter '');
> TRAP: failed Assert("delim_len > 0"), File: "copyfromparse.c", Line:
> 1173, PID: 903898

Fixed.

> ---
> - else
> + else if (cstate->opts.format == COPY_FORMAT_TEXT)
> fldct = CopyReadAttributesText(cstate);
> + else
> + {
> + elog(ERROR, "unexpected COPY format: %d", cstate->opts.format);
> + pg_unreachable();
> + }
>
> Since we already check the incompatible options with COPY_FORMAT_RAW
> and default_print, I think it's better to add an assertion to make
> sure the format is either COPY_FORMAT_CSV or COPY_FORMAT_TEXT, instead
> of using elog(ERROR).

I agree, fixed.

> ---
> +/*
> + * CopyReadLineRawText - inner loop of CopyReadLine for raw text mode
> + */
> +static bool
> +CopyReadLineRawText(CopyFromState cstate)
>
> This function has a lot of duplication with CopyReadLineText(). I
> think it's better to modify CopyReadLineText() to support 'raw'
> format, rather than adding a separate function.

Hmm, there is a bit of duplication, yes, but is also a hot-path,
so I think we want to minimize branches and code size in the
hot loop.

Combining them into one function, would mean the total function
size and branching increases for both cases.

I haven't made any benchmarks on this though.

> ---
> + bool read_entire_file = (cstate->opts.delim == NULL);
> + int delim_len = cstate->opts.delim ? strlen(cstate->opts.delim) : 0;
>
> I think we can use 'delim_len == 0' instead of read_entire_file.

Fixed.

> ---
> + if (read_entire_file)
> + {
> + /* Continue until EOF if reading entire file */
> + input_buf_ptr++;
> + continue;
> + }
>
> In the case where we're reading the entire file as a single tuple, we
> don't need to advance the input_buf_ptr one by one. Instead,
> input_buf_ptr can jump to copy_buf_len, which is faster.

Fixed.

> ---
> + /* Check for delimiter, possibly multi-byte */
> + IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(delim_len - 1);
> + if (strncmp(&copy_input_buf[input_buf_ptr], cstate->opts.delim,
> + delim_len) == 0)
> + {
> + cstate->eol_type = EOL_CUSTOM;
> + input_buf_ptr += delim_len;
> + break;
> + }
> + input_buf_ptr++;
>
> Similar to the above comment, I think we don't need to check the char
> one by one. I guess that it would be faster if we locate the delimiter
> string in the intput_buf (e.g. using strstr()), and then move
> input_buf_ptr to the detected position.

Fixed.

>
> ---
> + /* Copy the entire line into attribute_buf */
> + memcpy(cstate->attribute_buf.data, cstate->line_buf.data,
> + cstate->line_buf.len);
> + cstate->attribute_buf.data[cstate->line_buf.len] = '\0';
> + cstate->attribute_buf.len = cstate->line_buf.len;
>
> The CopyReadAttributesRaw() just copies line_buf data to
> attirbute_buf, which seems to be a waste. I think we can have
> attribute_buf point to the line_buf. That way, we can skip the whole
> step 4 that is described in the comment on top o f copyfromparse.c:
>
> * [data source] --> raw_buf --> input_buf --> line_buf --> attribute_buf
> * 1. 2. 3. 4.
>

Fixed. I've removed CopyReadAttributesRaw() entirely.

> ---
> +static int
> +CopyReadAttributesRaw(CopyFromState cstate)
> +{
> + /* Enforce single column requirement */
> + if (cstate->max_fields != 1)
> + {
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("COPY with format 'raw' must specify exactly
> one column")));
> + }
>
> This check should have already been done in BeginCopyFrom(). Is there
> any case where max_fields gets to != 1 during reading the input?

Good point. Removed.

>
> ---
> It's a bit odd to me to use the delimiter as a EOL marker in raw
> format, but probably it's okay.
>
> ---
> - if (cstate->opts.format != COPY_FORMAT_BINARY)
> + if (cstate->opts.format == COPY_FORMAT_RAW &&
> + cstate->opts.delim != NULL)
> + {
> + /* Output the user-specified delimiter between rows */
> + CopySendString(cstate, cstate->opts.delim);
> + }
> + else if (cstate->opts.format == COPY_FORMAT_TEXT ||
> + cstate->opts.format == COPY_FORMAT_CSV)
>
> Since it sends the delimiter as a string, even if we specify the
> delimiter to '\n', it doesn't send the new line (i.e. ASCII LF, 10).
> For example,
>
> postgres(1:904427)=# copy (select '{"j" : 1}'::jsonb) to stdout with
> (format 'raw', delimiter '\n');
> {"j": 1}\npostgres(1:904427)=#
>
> I think there is a similar problem in COPY FROM; if we set a delimiter
> to '\n' when doing COPY FROM in raw format, it expects the string '\n'
> as a line termination but not ASCII LF(10). I think that input data
> normally doesn't use the string '\n' as a line termination.

You need to use E'\n' to get ASCII LF(10), since '\n' is just a delimiter
consisting of backslash followed by "n".

Is this a problem? Since any string can be used as delimiter,
I think it would be strange if we parsed it and replaced the string
with a different string.

Another thought:

Maybe we shouldn't default to no delimiter after all,
maybe it would be better to default to the OS default EOL,
and maybe a final delimiter should always be written at the end,
so that when exporting a single json field, it would get exported
to the text file with \n at the end, which is what most text editor
does when saving a .json file.

/Joel

Attachment Content-Type Size
v16-0001-Introduce-CopyFormat-and-replace-csv_mode-and-binary.patch application/octet-stream 18.8 KB
v16-0002-Add-raw-format-to-COPY-command.patch application/octet-stream 41.4 KB
v16-0003-Reorganize-option-validations.patch application/octet-stream 19.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christoph Berg 2024-10-29 16:51:28 Re: RFC: Extension Packaging & Lookup
Previous Message Jacob Champion 2024-10-29 16:40:21 Re: [PoC] Federated Authn/z with OAUTHBEARER