RE: Parallel INSERT SELECT take 2

From: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Parallel INSERT SELECT take 2
Date: 2021-04-26 11:26:29
Message-ID: OS0PR01MB57160FF6DF8C1FEC2FB85AE094429@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> > The parallel attributes in table means the parallel safety when user does some
> data-modification operations on it.
> > So, It only limit the use of parallel plan when using INSERT/UPDATE/DELETE.
>
> In that case, isn't it better to use the terminology "PARALLEL DML
> SAFE/UNSAFE/RESTRICTED" in the code and docs? This way, it will be clear that
> these tags don't affect parallel selects.

Makes sense, I recalled I have heart similar suggestion before.
If there are no other objections, I plan to change command to
PARALLEL DML in my next version patches.

> > > I may be wrong here: IIUC, the main problem we are trying to solve
> > > with the declarative approach is to let the user decide parallel
> > > safety for partition tables as it may be costlier for postgres to
> > > determine it. And for the normal tables we can perform parallel
> > > safety checks without incurring much cost. So, I think we should restrict the
> declarative approach to only partitioned tables?
> >
> > Yes, we are tring to avoid overhead when checking parallel safety.
> > The cost to check all the partition's parallel safety is the biggest one.
> > Another is the safety check of index's expression.
> > Currently, for INSERT, the planner does not open the target table's
> > indexinfo and does not parse the expression of the index. We need to
> > parse the expression in planner if we want to do parallel safety check for it
> which can bring some overhead(it will open the index the do the parse in
> executor again).
> > So, we plan to skip all of the extra check and let user take responsibility for
> the safety.
> > Of course, maybe we can try to pass the indexinfo to the executor but it need
> some further refactor and I will take a look into it.
>
> Will the planner parse and check parallel safety of index((where
> clause) expressions in case of SELECTs? I'm not sure of this. But if it does, maybe
> we could do the same thing for parallel DML as well for normal tables?

The planner does not check the index expression, because the expression will not be used in SELECT.
I think the expression is only used when a tuple inserted into the index.

> the overhead of parsing index expressions? If the cost is heavy for checking
> index expressions parallel safety in case of normal tables, then the current
> design i.e. attributing parallel safety tag to all the tables makes sense.

Currently, index expression and predicate are stored in text format.
We need to use stringToNode(expression/predicate) to parse it.
Some committers think doing this twice does not look good,
unless we found some ways to pass parsed info to the executor to avoid the second parse.

> I was actually thinking that we will have the declarative approach only for
> partitioned tables as it is the main problem we are trying to solve with this
> design. Something like: users will run pg_get_parallel_safety to see the parallel
> unsafe objects associated with a partitioned table by looking at all of its
> partitions and be able to set a parallel dml safety tag to only partitioned tables.

We originally want to make the declarative approach for both normal and partitioned table.
In this way, it will not bring any overhead to planner and looks consistency.
But, do you think we should put some really cheap safety check to the planner ?

>
> >0001: provide a utility function pg_get_parallel_safety('table_name').
> >
> >The function returns records of (objid, classid, parallel_safety) that
> >represent the parallel safety of objects that determine the parallel safety of
> the specified table.
> >Note: The function only outputs objects that are not parallel safe.
>
> If it returns only parallel "unsafe" objects and not "safe" or "restricted" objects,
> how about naming it to pg_get_table_parallel_unsafe_objects("table_name")?

Currently, the function returns both unsafe and restricted objects(I thought restricted is also not safe),
because we thought users only care about the objects that affect the use of parallel plan.

> This way we could get rid of parallel_safety in the output record? If at all users
> want to see parallel restricted or safe objects, we can also have the
> counterparts pg_get_table_parallel_safe_objects and
> pg_get_table_parallel_restricted_objects. Of course, we can caution the user
> that execution of these functions might take longer and "might consume
> excessive memory while accumulating the output".
>
> Otherwise, we can have a single function pg_get_parallel_safety("table_name"
> IN, "parallel_safety" IN, "objid"
> OUT, "classid" OUT)? If required, we could name it
> pg_get_parallel_safety_of_table_objects.
>
> Thoughts?

I am sure will user want to get safe objects, do you have some usecases ?
If users do not care the safe objects, I think they can use
"SELECT * FROM pg_get_parallel_safety() where parallel_safety = 'specified safety' " to get the specified objects.

> Although, I have not looked at the patches, few questions on
> pg_get_parallel_safety function:
> 1) Will it parse all the expressions for the objects that are listed under "The
> objects that relate to the parallel safety of a DML target table are as follows:" in
> the upthread?

Yes.
But some parsed expression(such as domain type's expression) can be found in typecache,
we just check the safety for these already parsed expression.

> 2) How will it behave if a partitioned table is passed to it? Will it recurse for all
> the partitions?

Yes, because both parent table and child table will be inserted and the parallel
related objects on them will be executed. If users want to make sure the parallel insert succeed,
they need to check all the objects.

> 3) How will it behave if a foreign table is passed to it? Will it error out?

It currently does not error out.
It will also check the objects on it and return not safe objects.
Note: I consider foreign table itself as a parallel restricted object, because it does not support parallel insert fdw api for now.

> In general:
> 1) Is ALTER SET PARALLEL SAFETY on a partitioned table allowed?

Yes.

> If yes, will it be
> set based on all the partitions parallel safety?

Yes,
But the ALTER PARALLEL command itself does not check all the partition's safety flag.
The function pg_get_parallel_safety can return all the partition's not safe objects,
user should set the parallel safety based the function's result.

> 2) How will users have to decide on parallel safety of a foreign table or a
> partitioned table with foreign partitions? Or is it that we set these tables
> parallel unsafe and don't do parallel inserts?

Foreign table itself is considered as parallel restricted,
because we do not support parallel insert fdw api for now.

Best regards,
houzj

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2021-04-26 11:58:55 Enhanced error message to include hint messages for redundant options error
Previous Message vignesh C 2021-04-26 11:22:24 Re: [HACKERS] logical decoding of two-phase transactions