Re: New "raw" COPY format

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Masahiko Sawada" <sawada(dot)mshk(at)gmail(dot)com>
Cc: "jian he" <jian(dot)universality(at)gmail(dot)com>, "Tatsuo Ishii" <ishii(at)postgresql(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: New "raw" COPY format
Date: 2024-11-02 11:08:31
Message-ID: 39c01141-1f65-4e9d-82dd-976b4945a0f2@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 1, 2024, at 22:28, Masahiko Sawada wrote:
> As I mentioned in a separate email, if we use the OS default EOL as
> the default EOL in raw format, it would not be necessary to allow it
> to be multi characters. I think it's worth considering it.

I like the idea, but not sure I understand how it would work.

What if a user's OS default is \n (LF) and this user wants
to import a Windows text file \r\n (CR LR), which is a
multi characters EOL delimiter.

Was your idea to make an exception for that particular EOL,
or to simply not support that edge case?

> ---
> * copyfromparse.c
> * Parse CSV/text/binary format for COPY FROM.
>
> We need to update here as well.
>
> --
> - * 4. CopyReadAttributesText/CSV() function takes the input line from
> + * 4. CopyReadAttributesText/CSV/Raw() function takes the input line from
>
> I think we don't have CopyReadAttributesRaw() function now.
>
> ---
> I think it would be better to explain how to parse data in raw mode,
> especially which steps in the pipeline we skip, in the comment at the
> top of copyfromparse.c.
>
> ---
> - if (cstate->opts.format == COPY_FORMAT_CSV)
> + if (cstate->opts.format == COPY_FORMAT_RAW)
> {
> - /*
> - * If character is '\r', we may need to look ahead below. Force
> - * fetch of the next character if we don't already have it. We
> - * need to do this before changing CSV state, in case '\r' is also
> - * the quote or escape character.
> - */
> - if (c == '\r')
> + if (delim_len == 0)
> {
> - IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(0);
> + /* When reading entire file, consume all remaining
> bytes at once */
> + input_buf_ptr = copy_buf_len;
> + continue;
> }
> + else
>
> I think that the code become much simpler if we do 'continue' at the
> end of 'if (cstate->opts.format == COPY_FORMAT_RAW)' branch, instead
> of doing 'if (cstate->opts.format == COPY_FORMAT_RAW) {} else ...'.
>
> ---
> @@ -1461,6 +1536,7 @@ CopyReadLineText(CopyFromState cstate)
> return result;
> }
>
> +
> /*
> * Return decimal value for a hexadecimal digit
> */
> @@ -1937,7 +2013,6 @@ endfield:
> return fieldno;
> }
>
> -
> /*
> * Read a binary attribute
> */
>
> Unnecessary line addition and removal.

Thanks for review, I agree on all comments,
awaiting your comment on EOL before implementing changes.

/Joel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2024-11-02 11:43:10 Re: Fix for Extra Parenthesis in pgbench progress message
Previous Message Nitin Jadhav 2024-11-02 11:04:27 Re: Count and log pages set all-frozen by vacuum