Re: Partitioned tables and relfilenode

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partitioned tables and relfilenode
Date: 2017-03-13 10:24:32
Message-ID: a361f6c9-7262-1ad7-02a8-81014e36dde6@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2017/03/10 17:57, Amit Langote wrote:
> On 2017/03/08 22:36, Robert Haas wrote:
>> On Wed, Mar 8, 2017 at 5:36 AM, Amit Langote wrote:
>>>> - rel = mtstate->resultRelInfo->ri_RelationDesc;
>>>> + nominalRTE = rt_fetch(node->nominalRelation, estate->es_range_table);
>>>> + nominalRel = heap_open(nominalRTE->relid, NoLock);
>>>>
>>>> No lock?
>>>
>>> Hmm, I think I missed that a partitioned parent table would not be locked
>>> by the *executor* at all after applying this patch. As of now, InitPlan()
>>> takes care of locking all the result relations in the
>>> PlannedStmt.resultRelations list. This patch removes partitioned
>>> parent(s) from appearing in this list. But that makes me wonder - aren't
>>> the locks taken by expand_inherited_rtentry() kept long enough? Why does
>>> InitPlan need to acquire them again? I see this comment in
>>> expand_inherited_rtentry:
>>
>> Parse-time locks, plan-time locks, and execution-time locks are all
>> separate. See the header comments for AcquireRewriteLocks in
>> rewriteHandler.c; think about a view, where the parsed query has been
>> stored and is later rewritten and planned. Similarly, a plan can be
>> reused even after the transaction that created that plan has
>> committed; see AcquireExecutorLocks in plancache.c.
>
> Thanks for those references.
>
> I took a step back here and thought a bit more about the implications this
> patch. It occurred to me that the complete absence of partitioned table
> RT entries in the plan-tree has more consequences than I originally
> imagined. I will post an updated patch by Monday latest.

Here is the updated patch.

Since this patch proposes to avoid creating scan nodes for non-leaf tables
in a partition tree, they won't be referenced anywhere in the resulting
plan tree. So the executor will not lock those tables in the
select/update/delete cases. Insert is different since we lock all tables
in the partition tree when setting up tuple-routing in
ExecInitModifyTable. Not taking executor locks on the tables means that
the cached plans that should be invalidated upon adding/removing a
partition somewhere in the partition tree won't be.

First I thought that we could remember just the root table RT index using
a new Index field partitionRoot in Append, MergeAppend, and ModifyTable
nodes and use it to locate and lock the root table during executor node
initialization. But soon realized that that's not enough, because it
ignores the fact that adding/removing partitions at lower levels does not
require taking a lock on the root table; only the immediate parent. So
any cached select/update/delete plans won't be invalidated when a new
lower-level partition is added/removed, because the immediate parent would
not have been added to the query's range table and hence the
PlannedStmt.relationOids list. Remember that the latter is used by
plancache.c to remember the relations that a given cached plan depends on
remaining unchanged. So the patch now adds a list member called
partitioned_rels to Append, MergeAppend, and ModifyTable nodes and stores
the RT indexes of all the non-leaf tables in a partition tree with root
table RT index at the head (note that these RT indexes are of the RTEs
added by expand_inherited_rtenrty; also see below). Since the
partitioned_rels list is constructed when building paths and must be
propagated to the plan nodes, the same field is also present in the
corresponding Path nodes. ExecInit* routines for the aforementioned nodes
now locate and lock the non-leaf tables using the RT indexes in
partitioned_rels. Leaf tables are locked, as before, either by InitPlan
(update/delete result relations case) or by ExecInitAppend or
ExecInitMergeAppend when initializing the appendplans/mergeplans (select
case).

The previous proposal was for expand_inherited_rtentry to not create RT
entries and AppendRelInfo's for the non-leaf tables, but I think that
doesn't work, as I tried to explain above. We need RTEs because that
seems to be the only way currently for informing the executor of the
non-leaf tables. We need AppendRelInfo's to store the RT indexes of those
RTEs for the latter planning steps to collect them in partitioned_rels
mentioned above. So with the latest patch, we do create the RT entry and
AppendRelInfo for non-leaf tables. AppendRelInfo created in this case is
a minimal one; only parent_relid and child_relid are valid. To make the
latter planning steps ignore these minimal AppendRelInfo's, every
AppendRelInfo is now marked with child_relkind. Only
set_append_rel_pathlist() and inheritance_planner() process them to
collect the child_relid into the partitioned_rels list to be stored in
AppendPath/MergeAppendPath and ModifyTablePath, respectively.

Thoughts?

Thanks,
Amit

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2017-03-13 10:30:21 Re: Partitioned tables and relfilenode
Previous Message Anastasia Lubennikova 2017-03-13 10:01:08 Re: IF NOT EXISTS option for CREATE SERVER and CREATE USER MAPPING statements