From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
---|---|
To: | Joel Jacobson <joel(at)compiler(dot)org> |
Cc: | jian he <jian(dot)universality(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: New "raw" COPY format |
Date: | 2024-11-01 21:28:35 |
Message-ID: | CAD21AoAv5dyh4fTHRhUWPetpV8MxXP-8UY-5ECCGuPxWNYObpg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Oct 30, 2024 at 4:54 AM Joel Jacobson <joel(at)compiler(dot)org> wrote:
>
> On Wed, Oct 30, 2024, at 09:14, Joel Jacobson wrote:
> > $ psql -f bench_result.sql
>
> Ops, I realized I benchmarked a debug build,
> reran the benchmark with `meson setup build --buildtype=release`,
> and also added benchmarking of HEAD:
>
> [plot image showing time distribution by format and version]
>
>
> For those unable to see the image, here is the same data as a text table:
>
> select format, version, min(time_ms), round(avg(time_ms),2) avg, max(time_ms), round(stddev(time_ms)) as stddev from bench group by 1,2 order by 1,2;
>
> format | version | min | avg | max | stddev
> --------+---------+----------+---------+----------+--------
> csv | HEAD | 2862.537 | 2895.91 | 2968.420 | 27
> csv | v16 | 2834.663 | 2869.02 | 2926.489 | 28
> csv | v17 | 2958.570 | 3002.71 | 3056.949 | 31
> raw | v16 | 1697.899 | 1723.48 | 1798.020 | 22
> raw | v17 | 1720.251 | 1753.56 | 1811.399 | 28
> text | HEAD | 2442.845 | 2476.60 | 2543.767 | 33
> text | v16 | 2358.062 | 2389.06 | 2472.382 | 32
> text | v17 | 2517.778 | 2552.65 | 2621.161 | 30
> (8 rows)
>
> I think the improvement for the 'text' format between HEAD and v16 could be just noise,
> possibly due to different binary layout when compiling.
>
> v17 (single function) seems slower than v16 (separate functions), also in release build.
>
Here are review comments on v17 patches:
+ data values. Unlike in other formats, the delimiter in
+ <literal>raw</literal> format can be any string, including multi-byte
+ characters. If no <literal>DELIMITER</literal> is specified, the entire
+ input or output is treated as a single data value.
As I mentioned in a separate email, if we use the OS default EOL as
the default EOL in raw format, it would not be necessary to allow it
to be multi characters. I think it's worth considering it.
---
* copyfromparse.c
* Parse CSV/text/binary format for COPY FROM.
We need to update here as well.
--
- * 4. CopyReadAttributesText/CSV() function takes the input line from
+ * 4. CopyReadAttributesText/CSV/Raw() function takes the input line from
I think we don't have CopyReadAttributesRaw() function now.
---
I think it would be better to explain how to parse data in raw mode,
especially which steps in the pipeline we skip, in the comment at the
top of copyfromparse.c.
---
- if (cstate->opts.format == COPY_FORMAT_CSV)
+ if (cstate->opts.format == COPY_FORMAT_RAW)
{
- /*
- * If character is '\r', we may need to look ahead below. Force
- * fetch of the next character if we don't already have it. We
- * need to do this before changing CSV state, in case '\r' is also
- * the quote or escape character.
- */
- if (c == '\r')
+ if (delim_len == 0)
{
- IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(0);
+ /* When reading entire file, consume all remaining
bytes at once */
+ input_buf_ptr = copy_buf_len;
+ continue;
}
+ else
I think that the code become much simpler if we do 'continue' at the
end of 'if (cstate->opts.format == COPY_FORMAT_RAW)' branch, instead
of doing 'if (cstate->opts.format == COPY_FORMAT_RAW) {} else ...'.
---
@@ -1461,6 +1536,7 @@ CopyReadLineText(CopyFromState cstate)
return result;
}
+
/*
* Return decimal value for a hexadecimal digit
*/
@@ -1937,7 +2013,6 @@ endfield:
return fieldno;
}
-
/*
* Read a binary attribute
*/
Unnecessary line addition and removal.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
From | Date | Subject | |
---|---|---|---|
Next Message | Kirill Reshke | 2024-11-01 21:36:51 | Re: Using read_stream in index vacuum |
Previous Message | Masahiko Sawada | 2024-11-01 21:23:30 | Re: UUID v7 |