Re: Parallel copy

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel copy
Date: 2020-10-27 13:36:15
Message-ID: CALDaNm1cAONkFDN6K72DSiRpgqNGvwxQL7TjEiHZ58opnp9VoA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for the comments, please find my thoughts below.
On Wed, Oct 21, 2020 at 3:19 PM Bharath Rupireddy
<bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
>
> Hi Vignesh,
>
> I took a look at the v8 patch set. Here are some comments:
>
> 1. PopulateCommonCstateInfo() -- can we use PopulateCommonCStateInfo()
> or PopulateCopyStateInfo()? And also EstimateCstateSize() --
> EstimateCStateSize(), PopulateCstateCatalogInfo() --
> PopulateCStateCatalogInfo()?
>

Changed as suggested.

> 2. Instead of mentioning numbers like 1024, 64K, 10240 in the
> comments, can we represent them in terms of macros?
> /* It can hold 1024 blocks of 64K data in DSM to be processed by the worker. */
> #define MAX_BLOCKS_COUNT 1024
> /*
> * It can hold upto 10240 record information for worker to process. RINGSIZE
>

Changed as suggested.

> 3. How about
> "
> Each worker at once will pick the WORKER_CHUNK_COUNT records from the
> DSM data blocks and store them in it's local memory.
> This is to make workers not contend much while getting record
> information from the DSM. Read RINGSIZE comments before
> changing this value.
> "
> instead of
> /*
> * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data
> * block to process to avoid lock contention. Read RINGSIZE comments before
> * changing this value.
> */
>

Rephrased it.

> 4. How about one line gap before and after for comments: "Leader
> should operate in the following order:" and "Worker should operate in
> the following order:"
>

Changed it.

> 5. Can we move RAW_BUF_BYTES macro definition to the beginning of the
> copy.h where all the macro are defined?
>

Change was done as part of another commit & we are using as it is. I
preferred it to be as it is.

> 6. I don't think we need the change in toast_internals.c with the
> temporary hack Assert(!(IsParallelWorker() && !currentCommandIdUsed));
> in GetCurrentCommandId()
>

Modified it.

> 7. I think
> /* Can't perform copy in parallel */
> if (parallel_workers <= 0)
> return NULL;
> can be
> /* Can't perform copy in parallel */
> if (parallel_workers == 0)
> return NULL;
> as parallel_workers can never be < 0 since we enter BeginParallelCopy
> only if cstate->nworkers > 0 and also we are not allowed to have
> negative values for max_worker_processes.
>

Modified it.

> 8. Do we want to pfree(cstate->pcdata) in case we failed to start any
> parallel workers, we would have allocated a good
> else
> {
> /*
> * Reset nworkers to -1 here. This is useful in cases where user
> * specifies parallel workers, but, no worker is picked up, so go
> * back to non parallel mode value of nworkers.
> */
> cstate->nworkers = -1;
> *processed = CopyFrom(cstate); /* copy from file to database */
> }
>

Added pfree.

> 9. Instead of calling CopyStringToSharedMemory() for each string
> variable, can't we just create a linked list of all the strings that
> need to be copied into shm and call CopyStringToSharedMemory() only
> once? We could avoid 5 function calls?
>

I feel keeping it this way makes the code more readable, and also this
is not in a performance intensive tight loop. I'm retaining the
change as is unless we feel this will make an impact.

> 10. Similar to above comment: can we fill all the required
> cstate->variables inside the function CopyNodeFromSharedMemory() and
> call it only once? In each worker we could save overhead of 5 function
> calls.
>

same as above.

> 11. Looks like CopyStringFromSharedMemory() and
> CopyNodeFromSharedMemory() do almost the same things except
> stringToNode() and pfree(destptr);. Can we have a generic function
> CopyFromSharedMemory() or something else and handle with flag "bool
> isnode" to differentiate the two use cases?
>

Removed CopyStringFromSharedMemory & used CopyNodeFromSharedMemory
appropriately. CopyNodeFromSharedMemory is renamed to
RestoreNodeFromSharedMemory keep the name consistent.

> 12. Can we move below check to the end in IsParallelCopyAllowed()?
> /* Check parallel safety of the trigger functions. */
> if (cstate->rel->trigdesc != NULL &&
> !CheckRelTrigFunParallelSafety(cstate->rel->trigdesc))
> return false;
>

Modified.

> 13. CacheLineInfo(): Instead of goto empty_data_line_update; how about
> having this directly inside the if block as it's being used only once?
>

Have removed the goto by using a macro.

> 14. GetWorkerLine(): How about avoiding goto statements and replacing
> the common code with a always static inline function or a macro?
>

Have removed the goto by using a macro.

> 15. UpdateSharedLineInfo(): Below line is misaligned.
> lineInfo->first_block = blk_pos;
> lineInfo->start_offset = offset;
>

Changed it.

> 16. ParallelCopyFrom(): Do we need CHECK_FOR_INTERRUPTS(); at the
> start of for (;;)?
>

Added it.

> 17. Remove extra lines after #define IsHeaderLine()
> (cstate->header_line && cstate->cur_lineno == 1) in copy.h
>

Modified it.

Attached v9 patches have the fixes for the above comments.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
v9-0001-Copy-code-readjustment-to-support-parallel-copy.patch text/x-patch 30.9 KB
v9-0002-Allow-copy-from-command-to-process-data-from-file.patch text/x-patch 80.8 KB
v9-0003-Documentation-for-parallel-copy.patch text/x-patch 2.7 KB
v9-0004-Tests-for-parallel-copy.patch text/x-patch 34.5 KB
v9-0005-Parallel-Copy-For-Binary-Format-Files.patch text/x-patch 23.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2020-10-27 13:45:13 Re: SQL-standard function body
Previous Message Heikki Linnakangas 2020-10-27 13:23:00 Re: partition routing layering in nodeModifyTable.c