Re: Make COPY format extendable: Extract COPY TO format implementations

From: Sutou Kouhei <kou(at)clear-code(dot)com>
To: michael(at)paquier(dot)xyz
Cc: andres(at)anarazel(dot)de, sawada(dot)mshk(at)gmail(dot)com, zhjwpku(at)gmail(dot)com, andrew(at)dunslane(dot)net, nathandbossart(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Make COPY format extendable: Extract COPY TO format implementations
Date: 2024-02-22 09:39:48
Message-ID: 20240222.183948.518018047578925034.kou@clear-code.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

In <ZdbtQJ-p5H1_EDwE(at)paquier(dot)xyz>
"Re: Make COPY format extendable: Extract COPY TO format implementations" on Thu, 22 Feb 2024 15:44:16 +0900,
Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> I was comparing what you have here, and what's been attached by Andres
> at [1] and the top of the changes on my development branch at [2]
> (v3-0008, mostly). And, it strikes me that there is no need to do any
> major changes in any of the callbacks proposed up to v13 and v14 in
> this thread, as all the changes proposed want to plug in more data
> into each StateData for COPY FROM and COPY TO, the best part being
> that v3-0008 can just reuse the proposed callbacks as-is. v1-0001
> from Sutou-san would need one slight tweak in the per-row callback,
> still that's minor.

I think so too. But I thought that some minor conflicts will
be happen with this and the v15. So I worked on this before
the v15.

We agreed that this optimization doesn't block v15: [1]
So we can work on the v15 without this optimization for now.

[1] https://www.postgresql.org/message-id/flat/20240219195351.5vy7cdl3wxia66kg%40awork3.anarazel.de#20f9677e074fb0f8c5bb3994ef059a15

> I have been spending more time on the patch to introduce the COPY
> APIs, leading me to the v15 attached, where I have replaced the
> previous attribute callbacks for the output representation and the
> reads with hardcoded routines that should be optimized by compilers,
> and I have done more profiling with -O2.

Thanks! I wanted to work on it but I didn't have enough time
for it in a few days...

I've reviewed the v15.

----
> @@ -751,8 +751,9 @@ CopyReadBinaryData(CopyFromState cstate, char *dest, int nbytes)
> *
> * NOTE: force_not_null option are not applied to the returned fields.
> */
> -bool
> -NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)
> +static bool

"inline" is missing here.

> +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields,
> + bool is_csv)
> {
> int fldct;
----

How about adding "is_csv" to CopyReadline() and
CopyReadLineText() too?

----
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index 25b8d4bc52..79fabecc69 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -150,8 +150,8 @@ static const char BinarySignature[11] = "PGCOPY\n\377\r\n\0";


/* non-export function prototypes */
-static bool CopyReadLine(CopyFromState cstate);
-static bool CopyReadLineText(CopyFromState cstate);
+static inline bool CopyReadLine(CopyFromState cstate, bool is_csv);
+static inline bool CopyReadLineText(CopyFromState cstate, bool is_csv);
static inline int CopyReadAttributesText(CopyFromState cstate);
static inline int CopyReadAttributesCSV(CopyFromState cstate);
static Datum CopyReadBinaryAttribute(CopyFromState cstate, FmgrInfo *flinfo,
@@ -770,7 +770,7 @@ NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields,
tupDesc = RelationGetDescr(cstate->rel);

cstate->cur_lineno++;
- done = CopyReadLine(cstate);
+ done = CopyReadLine(cstate, is_csv);

if (cstate->opts.header_line == COPY_HEADER_MATCH)
{
@@ -823,7 +823,7 @@ NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields,
cstate->cur_lineno++;

/* Actually read the line into memory here */
- done = CopyReadLine(cstate);
+ done = CopyReadLine(cstate, is_csv);

/*
* EOF at start of line means we're done. If we see EOF after some
@@ -1133,8 +1133,8 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext,
* by newline. The terminating newline or EOF marker is not included
* in the final value of line_buf.
*/
-static bool
-CopyReadLine(CopyFromState cstate)
+static inline bool
+CopyReadLine(CopyFromState cstate, bool is_csv)
{
bool result;

@@ -1142,7 +1142,7 @@ CopyReadLine(CopyFromState cstate)
cstate->line_buf_valid = false;

/* Parse data and transfer into line_buf */
- result = CopyReadLineText(cstate);
+ result = CopyReadLineText(cstate, is_csv);

if (result)
{
@@ -1209,8 +1209,8 @@ CopyReadLine(CopyFromState cstate)
/*
* CopyReadLineText - inner loop of CopyReadLine for text mode
*/
-static bool
-CopyReadLineText(CopyFromState cstate)
+static inline bool
+CopyReadLineText(CopyFromState cstate, bool is_csv)
{
char *copy_input_buf;
int input_buf_ptr;
@@ -1226,7 +1226,7 @@ CopyReadLineText(CopyFromState cstate)
char quotec = '\0';
char escapec = '\0';

- if (cstate->opts.csv_mode)
+ if (is_csv)
{
quotec = cstate->opts.quote[0];
escapec = cstate->opts.escape[0];
@@ -1306,7 +1306,7 @@ CopyReadLineText(CopyFromState cstate)
prev_raw_ptr = input_buf_ptr;
c = copy_input_buf[input_buf_ptr++];

- if (cstate->opts.csv_mode)
+ if (is_csv)
{
/*
* If character is '\\' or '\r', we may need to look ahead below.
@@ -1345,7 +1345,7 @@ CopyReadLineText(CopyFromState cstate)
}

/* Process \r */
- if (c == '\r' && (!cstate->opts.csv_mode || !in_quote))
+ if (c == '\r' && (!is_csv || !in_quote))
{
/* Check for \r\n on first line, _and_ handle \r\n. */
if (cstate->eol_type == EOL_UNKNOWN ||
@@ -1373,10 +1373,10 @@ CopyReadLineText(CopyFromState cstate)
if (cstate->eol_type == EOL_CRNL)
ereport(ERROR,
(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
- !cstate->opts.csv_mode ?
+ !is_csv ?
errmsg("literal carriage return found in data") :
errmsg("unquoted carriage return found in data"),
- !cstate->opts.csv_mode ?
+ !is_csv ?
errhint("Use \"\\r\" to represent carriage return.") :
errhint("Use quoted CSV field to represent carriage return.")));

@@ -1390,10 +1390,10 @@ CopyReadLineText(CopyFromState cstate)
else if (cstate->eol_type == EOL_NL)
ereport(ERROR,
(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
- !cstate->opts.csv_mode ?
+ !is_csv ?
errmsg("literal carriage return found in data") :
errmsg("unquoted carriage return found in data"),
- !cstate->opts.csv_mode ?
+ !is_csv ?
errhint("Use \"\\r\" to represent carriage return.") :
errhint("Use quoted CSV field to represent carriage return.")));
/* If reach here, we have found the line terminator */
@@ -1401,15 +1401,15 @@ CopyReadLineText(CopyFromState cstate)
}

/* Process \n */
- if (c == '\n' && (!cstate->opts.csv_mode || !in_quote))
+ if (c == '\n' && (!is_csv || !in_quote))
{
if (cstate->eol_type == EOL_CR || cstate->eol_type == EOL_CRNL)
ereport(ERROR,
(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
- !cstate->opts.csv_mode ?
+ !is_csv ?
errmsg("literal newline found in data") :
errmsg("unquoted newline found in data"),
- !cstate->opts.csv_mode ?
+ !is_csv ?
errhint("Use \"\\n\" to represent newline.") :
errhint("Use quoted CSV field to represent newline.")));
cstate->eol_type = EOL_NL; /* in case not set yet */
@@ -1421,7 +1421,7 @@ CopyReadLineText(CopyFromState cstate)
* In CSV mode, we only recognize \. alone on a line. This is because
* \. is a valid CSV data value.
*/
- if (c == '\\' && (!cstate->opts.csv_mode || first_char_in_line))
+ if (c == '\\' && (!is_csv || first_char_in_line))
{
char c2;

@@ -1454,7 +1454,7 @@ CopyReadLineText(CopyFromState cstate)

if (c2 == '\n')
{
- if (!cstate->opts.csv_mode)
+ if (!is_csv)
ereport(ERROR,
(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
errmsg("end-of-copy marker does not match previous newline style")));
@@ -1463,7 +1463,7 @@ CopyReadLineText(CopyFromState cstate)
}
else if (c2 != '\r')
{
- if (!cstate->opts.csv_mode)
+ if (!is_csv)
ereport(ERROR,
(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
errmsg("end-of-copy marker corrupt")));
@@ -1479,7 +1479,7 @@ CopyReadLineText(CopyFromState cstate)

if (c2 != '\r' && c2 != '\n')
{
- if (!cstate->opts.csv_mode)
+ if (!is_csv)
ereport(ERROR,
(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
errmsg("end-of-copy marker corrupt")));
@@ -1508,7 +1508,7 @@ CopyReadLineText(CopyFromState cstate)
result = true; /* report EOF */
break;
}
- else if (!cstate->opts.csv_mode)
+ else if (!is_csv)
{
/*
* If we are here, it means we found a backslash followed by
----

> In this case, I have been able to limit the effects of the per-row
> callback by making NextCopyFromRawFields() local to copyfromparse.c
> while applying some inlining to it. This brings me to a different
> point, why don't we do this change independently on HEAD?

Does this mean that changing

bool NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)

to (adding "static")

static bool NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)

not (adding "static" and "bool is_csv")

static bool NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv)

improves performance?

If so, adding the change independently on HEAD makes
sense. But I don't know why that improves
performance... Inlining?

> It's not
> really complicated to make NextCopyFromRawFields show high in the
> profiles. I was looking at external projects, and noticed that
> there's nothing calling NextCopyFromRawFields() directly.

It means that we can hide NextCopyFromRawFields() without
breaking compatibility (because nobody uses it), right?

If so, I also think that we can change
NextCopyFromRawFields() directly.

If we assume that someone (not public code) may use it, we
can create a new internal function and use it something
like:

----
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index 7cacd0b752..b1515ead82 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -751,8 +751,8 @@ CopyReadBinaryData(CopyFromState cstate, char *dest, int nbytes)
*
* NOTE: force_not_null option are not applied to the returned fields.
*/
-bool
-NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)
+static bool
+NextCopyFromRawFieldsInternal(CopyFromState cstate, char ***fields, int *nfields)
{
int fldct;
bool done;
@@ -840,6 +840,12 @@ NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)
return true;
}

+bool
+NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)
+{
+ return NextCopyFromRawFieldsInternal(cstate, fields, nfields);
+}
+
/*
* Read next tuple from file for COPY FROM. Return false if no more tuples.
*
----

Thanks,
--
kou

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2024-02-22 09:55:42 Re: Test to dump and restore objects left behind by regression
Previous Message Andrew Dunstan 2024-02-22 09:38:32 Re: WIP Incremental JSON Parser