From: | Amit Langote <amitlangote09(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: generic plans and "initial" pruning |
Date: | 2022-02-10 08:13:52 |
Message-ID: | CA+HiwqFOsJ21nmvoKtPkzxKfYzACajj-mAOoJ+-y_O6g+-v-aA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Jan 13, 2022 at 3:20 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Jan 12, 2022 at 9:32 AM Amit Langote <amitlangote09(at)gmail(dot)com> wrote:
> > Or, maybe this won't be a concern if performing ExecutorStart() is
> > made a part of CheckCachedPlan() somehow, which would then take locks
> > on the relation as the PlanState tree is built capturing any plan
> > invalidations, instead of AcquireExecutorLocks(). That does sound like
> > an ambitious undertaking though.
>
> On the surface that would seem to involve abstraction violations, but
> maybe that could be finessed somehow. The plancache shouldn't know too
> much about what the executor is going to do with the plan, but it
> could ask the executor to perform a step that has been designed for
> use by the plancache. I guess the core problem here is how to pass
> around information that is node-specific before we've stood up the
> executor state tree. Maybe the executor could have a function that
> does the pruning and returns some kind of array of results that can be
> used both to decide what to lock and also what to consider as pruned
> at the start of execution. (I'm hand-waving about the details because
> I don't know.)
The attached patch implements this idea. Sorry for the delay in
getting this out and thanks to Robert for the off-list discussions on
this.
So the new executor "step" you mention is the function ExecutorPrep in
the patch, which calls a recursive function ExecPrepNode on the plan
tree's top node, much as ExecutorStart calls (via InitPlan)
ExecInitNode to construct a PlanState tree for actual execution
paralleling the plan tree.
For now, ExecutorPrep() / ExecPrepNode() does mainly two things if and
as it walks the plan tree: 1) Extract the RT indexes of RTE_RELATION
entries and add them to a bitmapset in the result struct, 2) If the
node contains a PartitionPruneInfo, perform its "initial pruning
steps" and store the result of doing so in a per-plan-node node called
PlanPrepOutput. The bitmapset and the array containing per-plan-node
PlanPrepOutput nodes are returned in a node called ExecPrepOutput,
which is the result of ExecutorPrep, to its calling module (say,
plancache.c), which, after it's done using that information, must pass
it forward to subsequent execution steps. That is done by passing it,
via the module's callers, to CreateQueryDesc() which remembers the
ExecPrepOutput in QueryDesc that is eventually passed to
ExecutorStart().
A bunch of other details are mentioned in the patch's commit message,
which I'm pasting below for anyone reading to spot any obvious flaws
(no-go's) of this approach:
Invent a new executor "prep" phase
The new phase, implemented by execMain.c:ExecutorPrep() and its
recursive underling execProcnode.c:ExecPrepNode(), takes a query's
PlannedStmt and processes the plan tree contained in it to produce
a ExecPrepOutput node as result.
As the plan tree is walked, each node must add the RT index(es) of
any relation(s) that it directly manipulates to a bitmapset member of
ExecPrepOutput (for example, an IndexScan node must add the Scan's
scanrelid). Also, each node may want to make a PlanPrepOutput node
containing additional information that may be of interest to the
calling module or to the later execution phases, if the node can
provide one (for example, an Append node may perform initial pruning
and add a set of "initially valid subplans" to the PlanPrepOutput).
The PlanPrepOutput nodess of all the plan nodes are added to an array
in the ExecPrepOutput, which is indexed using the individual nodes'
plan_node_id; a NULL is stored in the array slots of nodes that
don't have anything interesting to add to the PlanPrepOutput.
The ExecPrepOutput thus produced is passed to CreateQueryDesc()
and subsequently to ExecutorStart() via QueryDesc, which then makes
it available to the executor routines via the query's EState.
The main goal of adding this new phase is, for now, to allow cached
cached generic plans containing scans of partitioned tables using
Append/MergeAppend to be executed more efficiently by the prep phase
doing any initial pruning, instead of deferring that to
ExecutorStart(). That may allow AcquireExecutorLocks() on the plan
to lock only only the minimal set of relations/partitions, that is
those whose subplans survive the initial pruning.
Implementation notes:
* To allow initial pruning to be done as part of the pre-execution
prep phase as opposed to as part of ExecutorStart(), this refactors
ExecCreatePartitionPruneState() and ExecFindInitialMatchingSubPlans()
to pass the information needed to do initial pruning directly as
parameters instead of getting that from the EState and the PlanState
of the parent Append/MergeAppend, both of which would not be
available in ExecutorPrep(). Another, sort of non-essential-to-this-
goal, refactoring this does is moving the partition pruning
initialization stanzas in ExecInitAppend() and ExecInitMergeAppend()
both of which contain the same cod into its own function
ExecInitPartitionPruning().
* To pass the ExecPrepOutput(s) created by the plancache module's
invocation of ExecutorPrep() to the callers of the module, which in
turn would pass them down to ExecutorStart(), CachedPlan gets a new
List field that stores those ExecPrepOutputs, containing one element
for each PlannedStmt also contained in the CachedPlan. The new list
is stored in a child context of the context containing the
PlannedStmts, though unlike the latter, it is reset on every
invocation of CheckCachedPlan(), which in turn calls ExecutorPrep()
with a new set of bound Params.
* AcquireExecutorLocks() is now made to loop over a bitmapset of RT
indexes, those of relations returned in ExecPrepOutput, instead of
over the whole range table. With initial pruning that is also done
as part of ExcecutorPrep(), only relations from non-pruned nodes of
the plan tree would get locked as a result of this new arrangement.
* PlannedStmt gets a new field usesPrepExecPruning that indicates
whether any of the nodes of the plan tree contain "initial" (or
"pre-execution") pruning steps, which saves ExecutorPrep() the
trouble of walking the plan tree only to find out whether that's
the case.
* PartitionPruneInfo nodes now explicitly stores whether the steps
contained in any of the individual PartitionedRelPruneInfos embedded
in it contain initial pruning steps (those that can be performed
during ExecutorPrep) and execution pruning steps (those that can only
be performed during ExecutorRun), as flags contains_initial_steps and
contains_exec_steps, respectively. In fact, the aforementioned
PlannedStmt field's value is a logical OR of the values of the former
across all PartitionPruneInfo nodes embedded in the plan tree.
* PlannedStmt also gets a bitmapset field to store the RT indexes of
all relation RTEs referenced in the query that is populated when
contructing the flat range table in setrefs.c, which effectively
contains all the relations that the planner must have locked. In the
case of a cached plan, AcquireExecutorLocks() must lock all of those
relations, except those whose subnodes get pruned as result of
ExecutorPrep().
* PlannedStmt gets yet another field numPlanNodes that records the
highest plan_node_id assigned to any of the node contained in the
tree, which serves as the size to use when allocating the
PlanPrepOutput array.
Maybe this should be more than one patch? Say:
0001 to add ExecutorPrep and the boilerplate,
0002 to teach plancache.c to use the new facility
Thoughts?
--
Amit Langote
EDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
v4-0001-Invent-a-new-executor-prep-phase.patch | application/octet-stream | 169.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2022-02-10 08:57:59 | Re: Database-level collation version tracking |
Previous Message | Dilip Kumar | 2022-02-10 08:09:48 | Re: decoupling table and index vacuum |