From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> |
Cc: | vignesh C <vignesh21(at)gmail(dot)com>, "Hou, Zhijie" <houzj(dot)fnst(at)cn(dot)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Luc Vlaming <luc(at)swarm64(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Zhihong Yu <zyu(at)yugabyte(dot)com> |
Subject: | Re: Parallel Inserts in CREATE TABLE AS |
Date: | 2020-12-30 05:01:53 |
Message-ID: | CAFiTN-tp5LRn2o74neTG+75+mLJt9H5nvPs6iUoTmA_F2C3zZw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Dec 28, 2020 at 10:45 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Sun, Dec 27, 2020 at 2:20 PM Bharath Rupireddy
> <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
> >
> > On Sat, Dec 26, 2020 at 11:11 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > I have reviewed part of v15-0001 patch, I have a few comments, I will
> > > continue to review this.
> >
> > Thanks a lot.
> >
> > > 1.
> > > Why is this temporary hack? and what is the plan for removing this hack?
> >
> > The changes in xact.c, xact.h and heapam.c are common to all the
> > parallel insert patches - COPY, INSERT INTO SELECT. That was the
> > initial comment, I forgot to keep it in sync with the other patches.
> > Now, I used the comment from INSERT INTO SELECT patch. IIRC, the plan
> > was to have these code in all the parallel inserts patch, whichever
> > gets to review and commit first, others will update their patches
> > accordingly.
> >
> > > 2.
> > > +/*
> > > + * ChooseParallelInsertsInCTAS --- determine whether or not parallel
> > > + * insertion is possible, if yes set the parallel insert state i.e. push down
> > > + * the dest receiver to the Gather nodes.
> > > + */
> > > +void ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc)
> > > +{
> > > + if (!IS_CTAS(into))
> > > + return;
> > >
> > > When will this hit? The functtion name suggest that it is from CTAS
> > > but now you have a check that if it is
> > > not for CTAS then return, can you add the comment that when do you
> > > expect this case?
> >
> > Yes it will hit for explain cases, but I choose to remove this and
> > check outside in the explain something like:
> > if (into)
> > ChooseParallelInsertsInCTAS()
> >
> > > Also the function name should start in a new line
> > > i.e
> > > void
> > > ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc)
> >
> > Ah, missed that. Modified now.
> >
> > > 3.
> > > +/*
> > > + * ChooseParallelInsertsInCTAS --- determine whether or not parallel
> > > + * insertion is possible, if yes set the parallel insert state i.e. push down
> > > + * the dest receiver to the Gather nodes.
> > > + */
> > >
> > > Push down to the Gather nodes? I think the right statement will be
> > > push down below the Gather node.
> >
> > Modified.
> >
> > > 4.
> > > intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
> > > {
> > > DR_intorel *myState = (DR_intorel *) self;
> > >
> > > -- Comment ->in parallel worker we don't need to crease dest recv blah blah
> > > + if (myState->is_parallel_worker)
> > > {
> > > --parallel worker handling--
> > > return;
> > > }
> > >
> > > --non-parallel worker code stay right there, instead of moving to else
> >
> > Done.
> >
> > > 5.
> > > +/*
> > > + * ChooseParallelInsertsInCTAS --- determine whether or not parallel
> > > + * insertion is possible, if yes set the parallel insert state i.e. push down
> > > + * the dest receiver to the Gather nodes.
> > > + */
> > > +void ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc)
> > > +{
> > >
> > > From function name and comments it appeared that this function will
> > > return boolean saying whether
> > > Parallel insert should be selected or not. I think name/comment
> > > should be better for this
> >
> > Yeah that function can still return void because no point in returning
> > bool there, since the intention is to see if parallel inserts can be
> > performed, if yes, set the state otherwise exit. I changed the
> > function name to TryParallelizingInsertsInCTAS(). Let me know your
> > suggestions if that doesn't work out.
> >
> > > 6.
> > > /*
> > > + * For parallelizing inserts in CTAS i.e. making each parallel worker
> > > + * insert the tuples, we must send information such as into clause (for
> > > + * each worker to build separate dest receiver), object id (for each
> > > + * worker to open the created table).
> > >
> > > Comment is saying we need to pass object id but the code under this
> > > comment is not doing so.
> >
> > Improved the comment.
> >
> > > 7.
> > > + /*
> > > + * Since there are no rows that are transferred from workers to Gather
> > > + * node, so we set it to 0 to be visible in estimated row count of
> > > + * explain plans.
> > > + */
> > > + queryDesc->planstate->plan->plan_rows = 0;
> > >
> > > This seems a bit hackies Why it is done after the planning, I mean
> > > plan must know that it is returning a 0 rows?
> >
> > This exists to show up the estimated row count(in case of EXPLAIN CTAS
> > without ANALYZE) in the output. For EXPLAIN ANALYZE CTAS actual tuples
> > are shown correctly as 0 because Gather doesn't receive any tuples.
> > if (es->costs)
> > {
> > if (es->format == EXPLAIN_FORMAT_TEXT)
> > {
> > appendStringInfo(es->str, " (cost=%.2f..%.2f rows=%.0f width=%d)",
> > plan->startup_cost, plan->total_cost,
> > plan->plan_rows, plan->plan_width);
> >
> > Since it's an estimated row count(which may not be always correct), we
> > will let the EXPLAIN plan show that and I think we can remove that
> > part. Thoughts?
> >
> > I removed it in v6 patch set.
> >
> > > 8.
> > > + char *intoclause_space = shm_toc_allocate(pcxt->toc,
> > > + intoclause_len);
> > > + memcpy(intoclause_space, intoclausestr, intoclause_len);
> > > + shm_toc_insert(pcxt->toc, PARALLEL_KEY_INTO_CLAUSE, intoclause_space);
> > >
> > > One blank line between variable declaration and next code segment,
> > > take care at other places as well.
> >
> > Done.
> >
> > I'm attaching the v16 patch set. Please note that I added the
> > documentation saying that parallel insertions can happen and a sample
> > output of the explain to 0003 patch as discussed in [1]. But I didn't
> > move the explain output related code to a separate patch because it's
> > a small snippet in explain.c. I hope that's okay.
> >
> > [1] - https://www.postgresql.org/message-id/CAA4eK1JqwXGYoGa1%2B3-f0T50dBGufvKaKQOee_AfFhygZ6QKtA%40mail.gmail.com
> >
>
> Thanks for working on this, I will have a look at the updated patches soon.
I have completed reviewing 0001, I don't have more comments, just one
question. Soon I will review the remaining patches.
+ /* If parallel inserts are to be allowed, set a few extra information. */
+ if (myState->is_parallel)
+ {
+ myState->object_id = intoRelationAddr.objectId;
+
+ /*
+ * We don't need to skip contacting FSM while inserting tuples for
+ * parallel mode, while extending the relations, workers instead of
+ * blocking on a page while another worker is inserting, can check the
+ * FSM for another page that can accommodate the tuples. This results
+ * in major benefit for parallel inserts.
+ */
+ myState->ti_options = 0;
Is there any performance data for this or just theoretical analysis?
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Bharath Rupireddy | 2020-12-30 05:17:13 | Re: Parallel Inserts in CREATE TABLE AS |
Previous Message | Thomas Munro | 2020-12-30 04:53:57 | Re: Re: Re: Cache relation sizes? |