From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
---|---|
To: | Joel Jacobson <joel(at)compiler(dot)org> |
Cc: | jian he <jian(dot)universality(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: New "raw" COPY format |
Date: | 2024-10-28 17:50:37 |
Message-ID: | CAD21AoA1P1JNrHxHRmzJE=GDaUEm79pYYDvG9nPzQaeJLXiwsw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Oct 28, 2024 at 3:21 AM Joel Jacobson <joel(at)compiler(dot)org> wrote:
>
> On Mon, Oct 28, 2024, at 10:30, Joel Jacobson wrote:
> > On Mon, Oct 28, 2024, at 08:56, jian he wrote:
> >> /* Check force_quote */
> >> - if (!opts_out->csv_mode && (opts_out->force_quote ||
> >> opts_out->force_quote_all))
> >> + if (opts_out->format != COPY_FORMAT_CSV && (opts_out->force_quote ||
> >> + opts_out->force_quote_all))
> >> ereport(ERROR,
> >> (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> >> /*- translator: %s is the name of a COPY option, e.g. ON_ERROR */
> >>
> >> maybe this has a code indentation issue.
> >> since "if" and "opts_out" in the same column position.
> >
> > Thanks for review.
> >
> > I've fixed the indentation issues.
>
> I've now installed pgindent, and will use it from hereon, to avoid this class of problems.
>
> New version where all three patches are now indented using pgindent.
Thank you for updating the patch. Here are review comments on the v15
0002 patch:
When testing the patch with an empty delimiter, I got the following failure:
postgres(1:903898)=# copy hoge from '/tmp/tmp.raw' with (format 'raw',
delimiter '');
TRAP: failed Assert("delim_len > 0"), File: "copyfromparse.c", Line:
1173, PID: 903898
---
- else
+ else if (cstate->opts.format == COPY_FORMAT_TEXT)
fldct = CopyReadAttributesText(cstate);
+ else
+ {
+ elog(ERROR, "unexpected COPY format: %d", cstate->opts.format);
+ pg_unreachable();
+ }
Since we already check the incompatible options with COPY_FORMAT_RAW
and default_print, I think it's better to add an assertion to make
sure the format is either COPY_FORMAT_CSV or COPY_FORMAT_TEXT, instead
of using elog(ERROR).
---
+/*
+ * CopyReadLineRawText - inner loop of CopyReadLine for raw text mode
+ */
+static bool
+CopyReadLineRawText(CopyFromState cstate)
This function has a lot of duplication with CopyReadLineText(). I
think it's better to modify CopyReadLineText() to support 'raw'
format, rather than adding a separate function.
---
+ bool read_entire_file = (cstate->opts.delim == NULL);
+ int delim_len = cstate->opts.delim ? strlen(cstate->opts.delim) : 0;
I think we can use 'delim_len == 0' instead of read_entire_file.
---
+ if (read_entire_file)
+ {
+ /* Continue until EOF if reading entire file */
+ input_buf_ptr++;
+ continue;
+ }
In the case where we're reading the entire file as a single tuple, we
don't need to advance the input_buf_ptr one by one. Instead,
input_buf_ptr can jump to copy_buf_len, which is faster.
---
+ /* Check for delimiter, possibly multi-byte */
+ IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(delim_len - 1);
+ if (strncmp(©_input_buf[input_buf_ptr], cstate->opts.delim,
+ delim_len) == 0)
+ {
+ cstate->eol_type = EOL_CUSTOM;
+ input_buf_ptr += delim_len;
+ break;
+ }
+ input_buf_ptr++;
Similar to the above comment, I think we don't need to check the char
one by one. I guess that it would be faster if we locate the delimiter
string in the intput_buf (e.g. using strstr()), and then move
input_buf_ptr to the detected position.
---
+ /* Copy the entire line into attribute_buf */
+ memcpy(cstate->attribute_buf.data, cstate->line_buf.data,
+ cstate->line_buf.len);
+ cstate->attribute_buf.data[cstate->line_buf.len] = '\0';
+ cstate->attribute_buf.len = cstate->line_buf.len;
The CopyReadAttributesRaw() just copies line_buf data to
attirbute_buf, which seems to be a waste. I think we can have
attribute_buf point to the line_buf. That way, we can skip the whole
step 4 that is described in the comment on top o f copyfromparse.c:
* [data source] --> raw_buf --> input_buf --> line_buf --> attribute_buf
* 1. 2. 3. 4.
---
+static int
+CopyReadAttributesRaw(CopyFromState cstate)
+{
+ /* Enforce single column requirement */
+ if (cstate->max_fields != 1)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("COPY with format 'raw' must specify exactly
one column")));
+ }
This check should have already been done in BeginCopyFrom(). Is there
any case where max_fields gets to != 1 during reading the input?
---
It's a bit odd to me to use the delimiter as a EOL marker in raw
format, but probably it's okay.
---
- if (cstate->opts.format != COPY_FORMAT_BINARY)
+ if (cstate->opts.format == COPY_FORMAT_RAW &&
+ cstate->opts.delim != NULL)
+ {
+ /* Output the user-specified delimiter between rows */
+ CopySendString(cstate, cstate->opts.delim);
+ }
+ else if (cstate->opts.format == COPY_FORMAT_TEXT ||
+ cstate->opts.format == COPY_FORMAT_CSV)
Since it sends the delimiter as a string, even if we specify the
delimiter to '\n', it doesn't send the new line (i.e. ASCII LF, 10).
For example,
postgres(1:904427)=# copy (select '{"j" : 1}'::jsonb) to stdout with
(format 'raw', delimiter '\n');
{"j": 1}\npostgres(1:904427)=#
I think there is a similar problem in COPY FROM; if we set a delimiter
to '\n' when doing COPY FROM in raw format, it expects the string '\n'
as a line termination but not ASCII LF(10). I think that input data
normally doesn't use the string '\n' as a line termination.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
From | Date | Subject | |
---|---|---|---|
Next Message | Alexander Korotkov | 2024-10-28 18:35:02 | Re: pgsql: Implement pg_wal_replay_wait() stored procedure |
Previous Message | Bernd Helmle | 2024-10-28 17:44:10 | Re: [PATCH] Add sortsupport for range types and btree_gist |