From: | "Joel Jacobson" <joel(at)compiler(dot)org> |
---|---|
To: | "Masahiko Sawada" <sawada(dot)mshk(at)gmail(dot)com> |
Cc: | "jian he" <jian(dot)universality(at)gmail(dot)com>, "Tatsuo Ishii" <ishii(at)postgresql(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: New "raw" COPY format |
Date: | 2024-10-29 16:48:27 |
Message-ID: | 727a4968-a30d-4f64-9fbe-c1a43d5fed71@app.fastmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Oct 28, 2024, at 18:50, Masahiko Sawada wrote:
> Thank you for updating the patch. Here are review comments on the v15
> 0002 patch:
Thanks for review.
> When testing the patch with an empty delimiter, I got the following failure:
>
> postgres(1:903898)=# copy hoge from '/tmp/tmp.raw' with (format 'raw',
> delimiter '');
> TRAP: failed Assert("delim_len > 0"), File: "copyfromparse.c", Line:
> 1173, PID: 903898
Fixed.
> ---
> - else
> + else if (cstate->opts.format == COPY_FORMAT_TEXT)
> fldct = CopyReadAttributesText(cstate);
> + else
> + {
> + elog(ERROR, "unexpected COPY format: %d", cstate->opts.format);
> + pg_unreachable();
> + }
>
> Since we already check the incompatible options with COPY_FORMAT_RAW
> and default_print, I think it's better to add an assertion to make
> sure the format is either COPY_FORMAT_CSV or COPY_FORMAT_TEXT, instead
> of using elog(ERROR).
I agree, fixed.
> ---
> +/*
> + * CopyReadLineRawText - inner loop of CopyReadLine for raw text mode
> + */
> +static bool
> +CopyReadLineRawText(CopyFromState cstate)
>
> This function has a lot of duplication with CopyReadLineText(). I
> think it's better to modify CopyReadLineText() to support 'raw'
> format, rather than adding a separate function.
Hmm, there is a bit of duplication, yes, but is also a hot-path,
so I think we want to minimize branches and code size in the
hot loop.
Combining them into one function, would mean the total function
size and branching increases for both cases.
I haven't made any benchmarks on this though.
> ---
> + bool read_entire_file = (cstate->opts.delim == NULL);
> + int delim_len = cstate->opts.delim ? strlen(cstate->opts.delim) : 0;
>
> I think we can use 'delim_len == 0' instead of read_entire_file.
Fixed.
> ---
> + if (read_entire_file)
> + {
> + /* Continue until EOF if reading entire file */
> + input_buf_ptr++;
> + continue;
> + }
>
> In the case where we're reading the entire file as a single tuple, we
> don't need to advance the input_buf_ptr one by one. Instead,
> input_buf_ptr can jump to copy_buf_len, which is faster.
Fixed.
> ---
> + /* Check for delimiter, possibly multi-byte */
> + IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(delim_len - 1);
> + if (strncmp(©_input_buf[input_buf_ptr], cstate->opts.delim,
> + delim_len) == 0)
> + {
> + cstate->eol_type = EOL_CUSTOM;
> + input_buf_ptr += delim_len;
> + break;
> + }
> + input_buf_ptr++;
>
> Similar to the above comment, I think we don't need to check the char
> one by one. I guess that it would be faster if we locate the delimiter
> string in the intput_buf (e.g. using strstr()), and then move
> input_buf_ptr to the detected position.
Fixed.
>
> ---
> + /* Copy the entire line into attribute_buf */
> + memcpy(cstate->attribute_buf.data, cstate->line_buf.data,
> + cstate->line_buf.len);
> + cstate->attribute_buf.data[cstate->line_buf.len] = '\0';
> + cstate->attribute_buf.len = cstate->line_buf.len;
>
> The CopyReadAttributesRaw() just copies line_buf data to
> attirbute_buf, which seems to be a waste. I think we can have
> attribute_buf point to the line_buf. That way, we can skip the whole
> step 4 that is described in the comment on top o f copyfromparse.c:
>
> * [data source] --> raw_buf --> input_buf --> line_buf --> attribute_buf
> * 1. 2. 3. 4.
>
Fixed. I've removed CopyReadAttributesRaw() entirely.
> ---
> +static int
> +CopyReadAttributesRaw(CopyFromState cstate)
> +{
> + /* Enforce single column requirement */
> + if (cstate->max_fields != 1)
> + {
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("COPY with format 'raw' must specify exactly
> one column")));
> + }
>
> This check should have already been done in BeginCopyFrom(). Is there
> any case where max_fields gets to != 1 during reading the input?
Good point. Removed.
>
> ---
> It's a bit odd to me to use the delimiter as a EOL marker in raw
> format, but probably it's okay.
>
> ---
> - if (cstate->opts.format != COPY_FORMAT_BINARY)
> + if (cstate->opts.format == COPY_FORMAT_RAW &&
> + cstate->opts.delim != NULL)
> + {
> + /* Output the user-specified delimiter between rows */
> + CopySendString(cstate, cstate->opts.delim);
> + }
> + else if (cstate->opts.format == COPY_FORMAT_TEXT ||
> + cstate->opts.format == COPY_FORMAT_CSV)
>
> Since it sends the delimiter as a string, even if we specify the
> delimiter to '\n', it doesn't send the new line (i.e. ASCII LF, 10).
> For example,
>
> postgres(1:904427)=# copy (select '{"j" : 1}'::jsonb) to stdout with
> (format 'raw', delimiter '\n');
> {"j": 1}\npostgres(1:904427)=#
>
> I think there is a similar problem in COPY FROM; if we set a delimiter
> to '\n' when doing COPY FROM in raw format, it expects the string '\n'
> as a line termination but not ASCII LF(10). I think that input data
> normally doesn't use the string '\n' as a line termination.
You need to use E'\n' to get ASCII LF(10), since '\n' is just a delimiter
consisting of backslash followed by "n".
Is this a problem? Since any string can be used as delimiter,
I think it would be strange if we parsed it and replaced the string
with a different string.
Another thought:
Maybe we shouldn't default to no delimiter after all,
maybe it would be better to default to the OS default EOL,
and maybe a final delimiter should always be written at the end,
so that when exporting a single json field, it would get exported
to the text file with \n at the end, which is what most text editor
does when saving a .json file.
/Joel
Attachment | Content-Type | Size |
---|---|---|
v16-0001-Introduce-CopyFormat-and-replace-csv_mode-and-binary.patch | application/octet-stream | 18.8 KB |
v16-0002-Add-raw-format-to-COPY-command.patch | application/octet-stream | 41.4 KB |
v16-0003-Reorganize-option-validations.patch | application/octet-stream | 19.9 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Christoph Berg | 2024-10-29 16:51:28 | Re: RFC: Extension Packaging & Lookup |
Previous Message | Jacob Champion | 2024-10-29 16:40:21 | Re: [PoC] Federated Authn/z with OAUTHBEARER |