Re: Make COPY format extendable: Extract COPY TO format implementations

From: Andres Freund <andres(at)anarazel(dot)de>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Sutou Kouhei <kou(at)clear-code(dot)com>, sawada(dot)mshk(at)gmail(dot)com, zhjwpku(at)gmail(dot)com, andrew(at)dunslane(dot)net, nathandbossart(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Make COPY format extendable: Extract COPY TO format implementations
Date: 2024-02-09 19:27:05
Message-ID: 20240209192705.5qdilvviq3py2voq@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2024-02-09 13:21:34 +0900, Michael Paquier wrote:
> +static void
> +CopyFromTextInFunc(CopyFromState cstate, Oid atttypid,
> + FmgrInfo *finfo, Oid *typioparam)
> +{
> + Oid func_oid;
> +
> + getTypeInputInfo(atttypid, &func_oid, typioparam);
> + fmgr_info(func_oid, finfo);
> +}

FWIW, we should really change the copy code to initialize FunctionCallInfoData
instead of re-initializing that on every call, realy makes a difference
performance wise.

> +/*
> + * CopyFromTextStart
> + *
> + * Start of COPY FROM for text/CSV format.
> + */
> +static void
> +CopyFromTextStart(CopyFromState cstate, TupleDesc tupDesc)
> +{
> + AttrNumber attr_count;
> +
> + /*
> + * If encoding conversion is needed, we need another buffer to hold the
> + * converted input data. Otherwise, we can just point input_buf to the
> + * same buffer as raw_buf.
> + */
> + if (cstate->need_transcoding)
> + {
> + cstate->input_buf = (char *) palloc(INPUT_BUF_SIZE + 1);
> + cstate->input_buf_index = cstate->input_buf_len = 0;
> + }
> + else
> + cstate->input_buf = cstate->raw_buf;
> + cstate->input_reached_eof = false;
> +
> + initStringInfo(&cstate->line_buf);

Seems kinda odd that we have a supposedly extensible API that then stores all
this stuff in the non-extensible CopyFromState.

> + /* create workspace for CopyReadAttributes results */
> + attr_count = list_length(cstate->attnumlist);
> + cstate->max_fields = attr_count;

Why is this here? This seems like generic code, not text format specific.

> + cstate->raw_fields = (char **) palloc(attr_count * sizeof(char *));
> + /* Set read attribute callback */
> + if (cstate->opts.csv_mode)
> + cstate->copy_read_attributes = CopyReadAttributesCSV;
> + else
> + cstate->copy_read_attributes = CopyReadAttributesText;
> +}

Isn't this precisely repeating the mistake of 2889fd23be56?

And, why is this done here? Shouldn't this decision have been made prior to
even calling CopyFromTextStart()?

> +/*
> + * CopyFromTextOneRow
> + *
> + * Copy one row to a set of `values` and `nulls` for the text and CSV
> + * formats.
> + */

I'm very doubtful it's a good idea to combine text and CSV here. They have
basically no shared parsing code, so what's the point in sending them through
one input routine?

> +bool
> +CopyFromTextOneRow(CopyFromState cstate,
> + ExprContext *econtext,
> + Datum *values,
> + bool *nulls)
> +{
> + TupleDesc tupDesc;
> + AttrNumber attr_count;
> + FmgrInfo *in_functions = cstate->in_functions;
> + Oid *typioparams = cstate->typioparams;
> + ExprState **defexprs = cstate->defexprs;
> + char **field_strings;
> + ListCell *cur;
> + int fldct;
> + int fieldno;
> + char *string;
> +
> + tupDesc = RelationGetDescr(cstate->rel);
> + attr_count = list_length(cstate->attnumlist);
> +
> + /* read raw fields in the next line */
> + if (!NextCopyFromRawFields(cstate, &field_strings, &fldct))
> + return false;
> +
> + /* check for overflowing fields */
> + if (attr_count > 0 && fldct > attr_count)
> + ereport(ERROR,
> + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
> + errmsg("extra data after last expected column")));

It bothers me that we look to be ending up with different error handling
across the various output formats, particularly if they're ending up in
extensions. That'll make it harder to evolve this code in the future.

> + fieldno = 0;
> +
> + /* Loop to read the user attributes on the line. */
> + foreach(cur, cstate->attnumlist)
> + {
> + int attnum = lfirst_int(cur);
> + int m = attnum - 1;
> + Form_pg_attribute att = TupleDescAttr(tupDesc, m);
> +
> + if (fieldno >= fldct)
> + ereport(ERROR,
> + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
> + errmsg("missing data for column \"%s\"",
> + NameStr(att->attname))));
> + string = field_strings[fieldno++];
> +
> + if (cstate->convert_select_flags &&
> + !cstate->convert_select_flags[m])
> + {
> + /* ignore input field, leaving column as NULL */
> + continue;
> + }
> +
> + cstate->cur_attname = NameStr(att->attname);
> + cstate->cur_attval = string;
> +
> + if (cstate->opts.csv_mode)
> + {

More unfortunate intermingling of multiple formats in a single routine.

> +
> + if (cstate->defaults[m])
> + {
> + /*
> + * The caller must supply econtext and have switched into the
> + * per-tuple memory context in it.
> + */
> + Assert(econtext != NULL);
> + Assert(CurrentMemoryContext == econtext->ecxt_per_tuple_memory);
> +
> + values[m] = ExecEvalExpr(defexprs[m], econtext, &nulls[m]);
> + }

I don't think it's good that we end up with this code in different copy
implementations.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mats Kindahl 2024-02-09 19:35:57 Re: glibc qsort() vulnerability
Previous Message Mats Kindahl 2024-02-09 19:26:48 Re: glibc qsort() vulnerability