From: | Alastair Turner <minion(at)decodable(dot)me> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Parallel copy |
Date: | 2020-02-14 13:45:55 |
Message-ID: | CAC0Gmyxf8xV9bbvPaJMEepDGC3cUoe=SQObzr4sMU8Ps8rptsg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, 14 Feb 2020 at 11:57, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Fri, Feb 14, 2020 at 3:36 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
> wrote:
> >
> > On Fri, Feb 14, 2020 at 9:12 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
...
> > > Another approach that came up during an offlist discussion with Robert
> > > is that we have one dedicated worker for reading the chunks from file
> > > and it copies the complete tuples of one chunk in the shared memory
> > > and once that is done, a handover that chunks to another worker which
> > > can process tuples in that area. We can imagine that the reader
> > > worker is responsible to form some sort of work queue that can be
> > > processed by the other workers. In this idea, we won't be able to get
> > > the benefit of initial tokenization (forming tuple boundaries) via
> > > parallel workers and might need some additional memory processing as
> > > after reader worker has handed the initial shared memory segment, we
> > > need to somehow identify tuple boundaries and then process them.
>
Parsing rows from the raw input (the work done by CopyReadLine()) in a
single process would accommodate line returns in quoted fields. I don't
think there's a way of getting parallel workers to manage the
in-quote/out-of-quote state required. A single worker could also process a
stream without having to reread/rewind so it would be able to process input
from STDIN or PROGRAM sources, making the improvements applicable to load
operations done by third party tools and scripted \copy in psql.
> >
...
>
> > > Another thing we need to figure out is the how many workers to use for
> > > the copy command. I think we can use it based on the file size which
> > > needs some experiments or may be based on user input.
> >
> > It seems like we don't even really have a general model for that sort
> > of thing in the rest of the system yet, and I guess some kind of
> > fairly dumb explicit system would make sense in the early days...
> >
>
> makes sense.
>
The ratio between chunking or line parsing processes and the parallel
worker pool would vary with the width of the table, complexity of the data
or file (dates, encoding conversions), complexity of constraints and
acceptable impact of the load. Being able to control it through user input
would be great.
--
Alastair
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Langote | 2020-02-14 14:22:05 | Re: assert pg_class.relnatts is consistent |
Previous Message | Arseny Sher | 2020-02-14 13:34:28 | Re: ERROR: subtransaction logged without previous top-level txn record |