From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Ants Aasma <ants(at)cybertec(dot)at> |
Cc: | vignesh C <vignesh21(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Alastair Turner <minion(at)decodable(dot)me>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Parallel copy |
Date: | 2020-04-08 11:12:13 |
Message-ID: | CAA4eK1+u6Mk9RpkeMOMqAt3EKiJ=MvBArHPvzD1uAWx3n+sCeQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Apr 7, 2020 at 7:08 PM Ants Aasma <ants(at)cybertec(dot)at> wrote:
>
> On Tue, 7 Apr 2020 at 08:24, vignesh C <vignesh21(at)gmail(dot)com> wrote:
> > Leader will create a circular queue
> > and share it across the workers. The circular queue will be present in
> > DSM. Leader will be using a fixed size queue to share the contents
> > between the leader and the workers. Currently we will have 100
> > elements present in the queue. This will be created before the workers
> > are started and shared with the workers. The data structures that are
> > required by the parallel workers will be initialized by the leader,
> > the size required in dsm will be calculated and the necessary keys
> > will be loaded in the DSM. The specified number of workers will then
> > be launched. Leader will read the table data from the file and copy
> > the contents to the queue element by element. Each element in the
> > queue will have 64K size DSA. This DSA will be used to store tuple
> > contents from the file. The leader will try to copy as much content as
> > possible within one 64K DSA queue element. We intend to store at least
> > one tuple in each queue element. There are some cases where the 64K
> > space may not be enough to store a single tuple. Mostly in cases where
> > the table has toast data present and the single tuple can be more than
> > 64K size. In these scenarios we will extend the DSA space accordingly.
> > We cannot change the size of the dsm once the workers are launched.
> > Whereas in case of DSA we can free the dsa pointer and reallocate the
> > dsa pointer based on the memory size required. This is the very reason
> > for choosing DSA over DSM for storing the data that must be inserted
> > into the relation.
>
> I think the element based approach and requirement that all tuples fit
> into the queue makes things unnecessarily complex. The approach I
> detailed earlier allows for tuples to be bigger than the buffer. In
> that case a worker will claim the long tuple from the ring queue of
> tuple start positions, and starts copying it into its local line_buf.
> This can wrap around the buffer multiple times until the next start
> position shows up. At that point this worker can proceed with
> inserting the tuple and the next worker will claim the next tuple.
>
IIUC, with the fixed size buffer, the parallelism might hit a bit
because till the worker copies the data from shared buffer to local
buffer the reader process won't be able to continue. I think there
will be somewhat more leader-worker coordination is required with the
fixed buffer size. However, as you pointed out, we can't allow it to
increase it to max_size possible for all tuples as that might require
a lot of memory. One idea could be that we allow it for first any
such tuple and then if any other element/chunk in the queue required
more memory than the default 64KB, then we will always fallback to use
the memory we have allocated for first chunk. This will allow us to
not use more memory except for one tuple and won't hit parallelism
much as in many cases not all tuples will be so large.
I think in the proposed approach queue element is nothing but a way to
divide the work among workers based on size rather than based on
number of tuples. Say if we try to divide the work among workers
based on start offsets, it can be more tricky. Because it could lead
to either a lot of contentention if we choose say one offset
per-worker (basically copy the data for one tuple, process it and then
pick next tuple) or probably unequal division of work because some can
be smaller and others can be bigger. I guess division based on size
would be a better idea. OTOH, I see the advantage of your approach as
well and I will think more on it.
>
> > We had a couple of options for the way in which queue elements can be stored.
> > Option 1: Each element (DSA chunk) will contain tuples such that each
> > tuple will be preceded by the length of the tuple. So the tuples will
> > be arranged like (Length of tuple-1, tuple-1), (Length of tuple-2,
> > tuple-2), .... Or Option 2: Each element (DSA chunk) will contain only
> > tuples (tuple-1), (tuple-2), ..... And we will have a second
> > ring-buffer which contains a start-offset or length of each tuple. The
> > old design used to generate one tuple of data and process tuple by
> > tuple. In the new design, the server will generate multiple tuples of
> > data per queue element. The worker will then process data tuple by
> > tuple. As we are processing the data tuple by tuple, I felt both of
> > the options are almost the same. However Design1 was chosen over
> > Design 2 as we can save up on some space that was required by another
> > variable in each element of the queue.
>
> With option 1 it's not possible to read input data into shared memory
> and there needs to be an extra memcpy in the time critical sequential
> flow of the leader. With option 2 data could be read directly into the
> shared memory buffer. With future async io support, reading and
> looking for tuple boundaries could be performed concurrently.
>
Yeah, option-2 sounds better.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Langote | 2020-04-08 11:16:48 | Re: adding partitioned tables to publications |
Previous Message | Prabhat Sahu | 2020-04-08 10:55:12 | Re: [Proposal] Global temporary tables |