Quick Links

Re: Parallel copy

From:	vignesh C <vignesh21(at)gmail(dot)com>
To:	Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Hou, Zhijie" <houzj(dot)fnst(at)cn(dot)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Ants Aasma <ants(at)cybertec(dot)at>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alastair Turner <minion(at)decodable(dot)me>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Parallel copy
Date:	2020-10-27 15:33:22
Message-ID:	CALDaNm3vatWC0cVzV9UXKiRdEjoBO9hGam66TTasuiqk-Cx5Ew@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Thanks Ashutosh for reviewing and providing your comments.

On Fri, Oct 23, 2020 at 5:43 PM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
>
> Hi Vignesh,
>
> Thanks for the updated patches. Here are some more comments that I can
> find after reviewing your latest patches:
>
> +/*
> + * This structure helps in storing the common data from CopyStateData that are
> + * required by the workers. This information will then be allocated and stored
> + * into the DSM for the worker to retrieve and copy it to CopyStateData.
> + */
> +typedef struct SerializedParallelCopyState
> +{
> + /* low-level state data */
> + CopyDest copy_dest; /* type of copy source/destination */
> + int file_encoding; /* file or remote side's character encoding */
> + bool need_transcoding; /* file encoding diff from server? */
> + bool encoding_embeds_ascii; /* ASCII can be non-first byte? */
> +
> ...
> ...
> +
> + /* Working state for COPY FROM */
> + AttrNumber num_defaults;
> + Oid relid;
> +} SerializedParallelCopyState;
>
> Can the above structure not be part of the CopyStateData structure? I
> am just asking this question because all the fields present in the
> above structure are also present in the CopyStateData structure. So,
> including it in the CopyStateData structure will reduce the code
> duplication and will also make CopyStateData a bit shorter.
>

I have removed the common members from the structure, now there are no
common members between CopyStateData & the new structure. I'm using
CopyStateData to copy to/from directly in the new patch.

> --
>
> + pcxt = BeginParallelCopy(cstate->nworkers, cstate, stmt->attlist,
> + relid);
>
> Do we need to pass cstate->nworkers and relid to BeginParallelCopy()
> function when we are already passing cstate structure, using which
> both of these information can be retrieved ?
>

nworkers need not be passed as you have suggested but relid need to be
passed as we will be setting it to pcdata, modified nworkers as
suggested.

> --
>
> +/* DSM keys for parallel copy. */
> +#define PARALLEL_COPY_KEY_SHARED_INFO 1
> +#define PARALLEL_COPY_KEY_CSTATE 2
> +#define PARALLEL_COPY_WAL_USAGE 3
> +#define PARALLEL_COPY_BUFFER_USAGE 4
>
> DSM key names do not appear to be consistent. For shared info and
> cstate structures, the key name is prefixed with "PARALLEL_COPY_KEY",
> but for WalUsage and BufferUsage structures, it is prefixed with
> "PARALLEL_COPY". I think it would be better to make them consistent.
>

Modified as suggested

> --
>
> if (resultRelInfo->ri_TrigDesc != NULL &&
> (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
> resultRelInfo->ri_TrigDesc->trig_insert_instead_row))
> {
> /*
> * Can't support multi-inserts when there are any BEFORE/INSTEAD OF
> * triggers on the table. Such triggers might query the table we're
> * inserting into and act differently if the tuples that have already
> * been processed and prepared for insertion are not there.
> */
> insertMethod = CIM_SINGLE;
> }
> else if (proute != NULL && resultRelInfo->ri_TrigDesc != NULL &&
> resultRelInfo->ri_TrigDesc->trig_insert_new_table)
> {
> /*
> * For partitioned tables we can't support multi-inserts when there
> * are any statement level insert triggers. It might be possible to
> * allow partitioned tables with such triggers in the future, but for
> * now, CopyMultiInsertInfoFlush expects that any before row insert
> * and statement level insert triggers are on the same relation.
> */
> insertMethod = CIM_SINGLE;
> }
> else if (resultRelInfo->ri_FdwRoutine != NULL ||
> cstate->volatile_defexprs)
> {
> ...
> ...
>
> I think, if possible, all these if-else checks in CopyFrom() can be
> moved to a single function which can probably be named as
> IdentifyCopyInsertMethod() and this function can be called in
> IsParallelCopyAllowed(). This will ensure that in case of Parallel
> Copy when the leader has performed all these checks, the worker won't
> do it again. I also feel that it will make the code look a bit
> cleaner.
>

In the recent patch posted we have changed it to simplify the check
for parallel copy, it is not an exact match. I feel this comment is
not applicable on the latest patch

> --
>
> +void
> +ParallelCopyMain(dsm_segment *seg, shm_toc *toc)
> +{
> ...
> ...
> + InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
> + &walusage[ParallelWorkerNumber]);
> +
> + MemoryContextSwitchTo(oldcontext);
> + pfree(cstate);
> + return;
> +}
>
> It seems like you also need to delete the memory context
> (cstate->copycontext) here.
>

Added it.

> --
>
> +void
> +ExecBeforeStmtTrigger(CopyState cstate)
> +{
> + EState *estate = CreateExecutorState();
> + ResultRelInfo *resultRelInfo;
>
> This function has a lot of comments which have been copied as it is
> from the CopyFrom function, I think it would be good to remove those
> comments from here and mention that this code changes done in this
> function has been taken from the CopyFrom function. If any queries
> people may refer to the CopyFrom function. This will again avoid the
> unnecessary code in the patch.
>

Changed as suggested.

> --
>
> As Heikki rightly pointed out in his previous email, we need some high
> level description of how Parallel Copy works somewhere in
> copyparallel.c file. For reference, please see how a brief description
> about parallel vacuum has been added in the vacuumlazy.c file.
>
> * Lazy vacuum supports parallel execution with parallel worker processes. In
> * a parallel vacuum, we perform both index vacuum and index cleanup with
> * parallel worker processes. Individual indexes are processed by one vacuum
> ...

Added it in copyparallel.c

This is addressed in v9 patch shared at [1].
[1] - https://www.postgresql.org/message-id/CALDaNm1cAONkFDN6K72DSiRpgqNGvwxQL7TjEiHZ58opnp9VoA@mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

In response to

Re: Parallel copy at 2020-10-23 12:12:49 from Ashutosh Sharma

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Stephen Frost	2020-10-27 15:35:26	Re: automatic analyze: readahead - add "IO read time" log message
Previous Message	vignesh C	2020-10-27 15:26:41	Re: Parallel copy