Re: Determine parallel-safety of partition relations for Inserts

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Amit Langote <amitlangote09(at)gmail(dot)com>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)fujitsu(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Determine parallel-safety of partition relations for Inserts
Date: 2021-01-18 03:04:38
Message-ID: CAA4eK1JjT2AOMNUHHBVTeMBbvFObSveZWe5jE8bcvV5cd7yzoQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jan 17, 2021 at 4:45 PM Amit Langote <amitlangote09(at)gmail(dot)com> wrote:
>
> On Sat, Jan 16, 2021 at 2:02 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > On Fri, Jan 15, 2021 at 6:45 PM Amit Langote <amitlangote09(at)gmail(dot)com> wrote:
> > > On Fri, Jan 15, 2021 at 9:59 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > We want to do this for Inserts where only Select can be parallel and
> > > > Inserts will always be done by the leader backend. This is actually
> > > > the case we first want to implement.
> > >
> > > Sorry, I haven't looked at the linked threads and the latest patches
> > > there closely enough yet, so I may be misreading this, but if the
> > > inserts will always be done by the leader backend as you say, then why
> > > does the planner need to be checking the parallel safety of the
> > > *target* table's expressions?
> > >
> >
> > The reason is that once we enter parallel-mode we can't allow
> > parallel-unsafe things (like allocation of new CIDs, XIDs, etc.). We
> > enter the parallel-mode at the beginning of the statement execution,
> > see ExecutePlan(). So, the Insert will be performed in parallel-mode
> > even though it happens in the leader backend. It is not possible that
> > we finish getting all the tuples from the gather node first and then
> > start inserting. Even, if we somehow find something to make this work
> > anyway the checks being discussed will be required to make inserts
> > parallel (where inserts will be performed by workers) which is
> > actually the next patch in the thread I mentioned in the previous
> > email.
> >
> > Does this answer your question?
>
> Yes, thanks for the explanation. I kind of figured that doing the
> insert part itself in parallel using workers would be a part of the
> end goal of this work, although that didn't come across immediately.
>
> It's a bit unfortunate that the parallel safety check of the
> individual partitions cannot be deferred until it's known that a given
> partition will be affected by the command at all. Will we need
> fundamental changes to how parallel query works to make that possible?
> If so, have such options been considered in these projects?
>

I think it is quite fundamental to how parallel query works and we
might not be able to change it due to various reasons like (a) it will
end up generating a lot of paths in optimizer when it is not safe to
do so and in the end, we won't use it. (b) If after inserting into a
few partitions we came to know that the next partition we are going to
insert has some parallel-unsafe constraints then we need to give up
the execution and restart the statement by again trying to first plan
it by having non-parallel paths. Now, we can optimize this by
retaining both parallel and non-parallel plans such that if we fail to
execute parallel-plan we can use a non-parallel plan to execute the
statement but still that doesn't seem like an advisable approach.

The extra time spent in optimizer will pay-off well by the parallel
execution. As pointer earlier, you can see one of the results shared
on the other thread [1]. The cases where it might not get the benefit
(say when the underlying plan is non-parallel) can have some impact
but still, we have not tested that in detail. The ideas we have
discussed so far to address that are (a) postpone parallel-safety
checks for partitions till there are some underneath partial paths
(from which parallel paths can be generated) but that has some
down-side in that we will end up generating partial paths when that is
really not required, (b) have a rel option like parallel_dml_workers
or use existing option parallel_workers to allow considering parallel
insert for a relation. Any better ideas?

> If such
> changes are not possible in the short term, like for v14, we should at
> least try to make sure that the eager checking of all partitions is
> only performed if using parallelism is possible at all.
>

As of now, we do first check if it is safe to generate a parallel plan
for underlying select (in Insert into .... Select ..) and then perform
parallel-safety checks for the DML. We can postpone it further as
suggested above in (a).

> I will try to take a look at the patches themselves to see if there's
> something I know that will help.
>

Thank you. It will be really helpful if you can do that.

[1] - https://www.postgresql.org/message-id/b54f2e306780449093c311118cd8a04e%40G08CNEXMBPEKD05.g08.fujitsu.local

--
With Regards,
Amit Kapila.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Nancarrow 2021-01-18 03:12:07 Re: Parallel INSERT (INTO ... SELECT ...)
Previous Message torikoshia 2021-01-18 03:00:22 Re: adding wait_start column to pg_locks