From: | Alex K <kondratov(dot)aleksey(at)gmail(dot)com> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Cc: | Stephen Frost <sfrost(at)snowman(dot)net>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Anastasia Lubennikova <lubennikovaAV(at)gmail(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> |
Subject: | Re: Parallel COPY FROM execution |
Date: | 2017-08-11 13:55:56 |
Message-ID: | CADfU8WxY4s73j8QSVXuk7qwLzK_M5ANp6kftB-HMfQHBAeBd=Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Greetings pgsql-hackers,
I have completed the general infrastructure for parallel COPY FROM execution,
consisting of Main (master) process and multiple BGWorkers connected with master
using a personal message query (shm_mq).
Master process does:
- Dynamic shared memory allocation with parallel state across
BGWorkers and master
- Attaching every worker to the personal message query (shm_mq)
- Wait workers initialization using Latch
- Read raw text lines using CopyReadLine and puts them into shm_mq's
via round-robin to balance queries load
- When EOF is reached sends zero-length message and workers are safely
shut down when receive it
- Wait for worker until they complete their jobs using ConditionalVariable
Each BGWorker does:
- Signal master on initialization via Latch
- Receive raw text lines over the personal shm_mq and put them into
the log (for now)
- Reinitialize db connection using the same db_id and user_id as main process
- Signal master via ConditionalVariable on job done
All parallel state modifications are done under LWLocks.
You can find actual code here https://github.com/ololobus/postgres/pull/2/files
(it is still in progress, so has a lot of duplications and comments,
which are to-be-deleted)
To go forward I have to overcome some obstacles:
- Currently all copy.c methods are designed to work with one giant
structure – CopyState.
It includes buffers, many initial parameters which stay unchanged
and a few variables
which vary during COPY FROM execution. Since I need all these
parameters, I have to
obtain them somehow inside each BGWorker process. I see two possible
solutions here:
1) Perform BeginCopyFrom initialization inside master and put
required parameters into
shared memory. However, many of them are arrays of a variable size
(e.g. partition_tupconv_maps, force_notnull_flags), so I cannot
put them into shmem
inside one single struct. The best idea I have is to put each
parameter under the personal
shmem TOC key, which seems to be quite ugly.
2) Perform BeginCopyFrom initialization inside each BGWorker.
However, it also opens
input file/pipe for read, which is not necessary for workers and
may cause some
interference with master, but I can modify BeginCopyFrom.
- I have used both Latch and ConditionalVariable for the same
purpose–wait until some signal
occurs–and for me as an end user they perform quite similar. I
looked into the condition_variable.c
code and it uses Latch and SpinLock under the hood. So what are
differences and dis-/advantages
between Latch and ConditionalVariable?
I will be glad if someone will help me to find an answer to my
question; also any comments
and remarks to the overall COPY FROM processing architecture are very welcome.
Alexey
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2017-08-11 15:14:44 | Re: POC: Sharing record typmods between backends |
Previous Message | Tom Lane | 2017-08-11 13:55:19 | Re: Thoughts on unit testing? |