Re: Declarative partitioning

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Declarative partitioning
Date: 2016-02-18 05:41:07
Message-ID: 56C55973.80309@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2016/02/16 21:57, Robert Haas wrote:
> On Fri, Jan 15, 2016 at 5:48 AM, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> If we have a CREATE statement for each partition, how do we generalize
>> that to partitions at different levels? For example, if we use something
>> like the following to create a partition of parent_name:
>>
>> CREATE PARTITION partition_name OF parent_name FOR VALUES ...
>> WITH ... TABLESPACE ...
>>
>> Do we then say:
>>
>> CREATE PARTITION subpartition_name OF partition_name ...
>>
>> to create a level 2 partition (sub-partition) of parent_name?
>
> Yes, exactly.
>
> Personally, I would be more inclined to make this a CREATE TABLE statement, like
>
> CREATE TABLE partition_name PARTITION OF parent_name FOR VALUES ...
> CREATE TABLE subpartition_name PARTITION OF partition_name FOR VALUES ...

I guess the first of which would actually be:

CREATE TABLE partition_name PARTITION OF parent_name FOR VALUES ...
PARTITION BY ...

Some might think that writing potentially the same PARTITION BY clause 100
times for 100 level-1 partitions could be cumbersome. That is what
SUBPARTITION BY notation may be good as a shorthand for.

> I think if you've got SUBPARTITION as a keyword in the syntax
> anywhere, you're doing it wrong. The toplevel object shouldn't really
> care whether its children are themselves partitioned or not.

This is fine for the internals. SUBPARTITION BY is mere syntax sugar. It
might as well be just cascaded PARTITION BYs. The point is to specify as
many of those with CREATE TABLE toplevel as the number of levels of
partitioning we want. That does however fix the number of levels in advance.

In the patch I have posted, here are some details of the tuple routing
implementation for instance - the top-level parent knows only of its
immediate partitions. Once a level-1 partition from that list is
determined using a tuple's level-1 key, the tuple is passed down to choose
one of its own partitions using the level-2 key. The key descriptor list
is associated with the top-level parent alone and the recursive code knows
to iterate the key list to apply nth key for nth level. The recursion
happens in the partition module.

Now there are also functions to let obtain, say *all* or only *leaf*
partitions (OIDs) of a top-level parent but they are used for certain DDL
scenarios. As far as DML is concerned, the level-at-a-time recursive
approach as described above is used. Queries not yet because the plan is a
flattened append of leaf partitions anyway.

If such notation convenience at the expense of loss of generality is not
worth it, I'm willing to go ahead and implement SUBPARTITION-less syntax.

CREATE TABLE toplevel() PARTITION BY
CREATE TABLE partition PARTITION OF toplevel FOR VALUES ... PARTITION BY
CREATE TABLE subpartition PARTITION OF partition FOR VALUES
ALTER TABLE partitioned ATTACH PARTITION name FOR VALUES ... USING TABLE
ALTER TABLE partitioned DETACH PARTITION name [ USING newname ]

While we are on the syntax story, how about FOR VALUES syntax for range
partitions (sorry for piggybacking it here in this message). At least some
people dislike LESS THAN notation. Corey Huinker says we should be using
range type literals for that. It's not clear though that using range type
literals directly is a way ahead. For one, range type API expects there to
exist a range type with given element type. Whereas all we require for
range partitioning proper is for column type to have a btree operator
class. Should we require it to have an associated range type as well?
Don't think that there exists an API to check that either. All in all,
range types are good to implement things in applications but not so much
within the backend (unless I'm missing something). I know reinventing the
wheel is disliked as well but perhaps we could offer something like the
following because Corey offered some examples which would help from the
flexibility:

START [ EXCL ] (startval) END [ INCL ] (endval)

That is, in range type notation, '[startval, endval)' is the default
behavior. So for each partition, there is at least the following pieces of
metadata:

Datum *startval;
bool startexcl;
Datum *endval;
bool endincl;

That requires quite some gymnastics during DDL (with no help from range
type infrastructure) to maintain the invariant that no two range
partitions overlap. Even gaps can result which are considered undesirable,
so maybe, the invariant to maintain would include no gaps in addition to
no overlap. That would have us looking for a definition of a "gap" for all
sorts of btree supporting data types.

Some people said something akin to interval partitioning would be good like:

PARTITION BY RANGE ON (columns) INCREMENT BY (INTERVAL '1 month' ) START
WITH value;

But that could be just a UI addition to the design where each partition
has [startval, endval) bounds. In any case, not a version 1 material I'd
think.

>> Do we want this at all? It seems difficult to generalize this to
>> multi-level hierarchy of more than 2 levels.
>
> It doesn't do anything for me. There may be somebody who wants it,
> but I don't see much value myself.

OK.

>> After thinking some more on this - I think that identical tuple
>> descriptors may not just be a nice-to-have but critical in some cases. For
>> example, consider built-in/trigger-less tuple routing. I'd imagine that
>> the partition to insert a tuple into would be determined just before
>> calling heap_insert() in ExecInsert() and CopyFrom(). That means the
>> HeapTuple that is passed to heap_insert() to insert into the partition
>> would be based on the root table's tuple descriptor. Note also that the
>> tuple would have passed through BR, IR triggers, constraints of the root
>> table. When the data is eventually queried from partitions directly, or
>> well even via the root table (considering existing executor capabilities),
>> partition's tuple descriptor at that point had better match the data that
>> went onto the disk. That means we had better keep at least the following
>> things in sync: number of attributes, name, position (attnum), type,
>> notnull-ness of individual attributes. So in order to do that, recursively
>> apply ADD/DROP COLUMN, SET WITH/WITHOUT OIDS, RENAME COLUMN, ALTER COLUMN
>> TYPE, SET/DROP NOT NULL on the root table to all the partitions and
>> prevent those sub-commands to be directly applied to any table
>> (partitions) in the partitioning hierarchy but the root. I further don't
>> see the point of allowing to set (or drop) column defaults in partitions
>> because now INSERT or COPY FROM cannot be directly applied to partitions.
>> Similar argument could be made for BR, IR triggers and CHECK constraints.
>> Am I missing something in all of this?
>
> Well, in the end there are basically two choices. Either tuple
> descriptors have to match exactly, and then you can reuse a tuple
> intended for one partition for some other partition without
> projection; or else they don't, and you need to project. I'm not sure
> that projection is expensive enough to make disallowing mismatched
> tuple descriptors a necessary design choice - and certainly that
> design choice is awkward from a UI standpoint, because we will
> sometimes not be able to attach a partition for a reason that the user
> can neither see in the \d output nor correct. But on the flip side,
> not having to worry about projection is nice.
>
>> An alternative to doing any of that very well may be to design
>> trigger-less tuple routing to be smarter about possible mismatch of the
>> tuple descriptors but I haven't given that a lot of thought. Is that
>> really an alternative worth looking into?
>
> Yes.

I see the trade-off. I agree that considering the significance for attach
partition case is quite important.

So, the tuple routing code should be ready to use projection if there
happens to be a partition with differing tuple descriptor. In the code I
posted, a ResultRelInfo is lazily built afresh for each inserted tuple in
ExecInsert's case and for each tuple where the chosen partition is
different from the previous tuple's in CopyFrom's case. One can feel that
there is a certain overhead to that approach for the bulk-loading case
(almost every CopyFrom invocation). Now if projection enters this picture,
we have to consider that too. Should we initialize ResultRelInfo's and
corresponding ProjectionInfo's for all partitions beforehand? Consider how
ModifyTable currently handles update on inheritance set, for example. That
would incur unnecessary overhead if only a single tuple is inserted. But
it would certainly help bulk-loading case. Am I missing something?

>> On one hand, I think to keep treating "partition hierarchies" as
>> "inheritance hierachies" might have some issues. I am afraid that
>> documented inheritance semantics may not be what we want to keep using for
>> the new partitioned tables. By that, I mean all the user-facing behaviors
>> where inheritance has some bearing. Should it also affect new partitioned
>> tables? Consider whether inheritance semantics would render infeasible
>> some of the things that we'd like to introduce for the new partitioned
>> tables such as automatic tuple routing, or keep us from improving planner
>> smarts and executor capabilities for partitioned tables over what we
>> already have.
>>
>> OTOH, I may be looking at it wrongly. We would not be required to enforce
>> user-facing inheritance behaviors on the new partitioned tables after all.
>> That is to say - it's just that new partitioned tables could still use
>> relevant inheritance infrastructure behind-the-scenes for planning,
>> execution and a few other things and not care about abiding by regular
>> inheritance semantics. I should just go ahead and add special cases in all
>> places where existing inheritance handling code stands to cause trouble
>> down the line for partitioned tables. We might want to mention that we do
>> so somewhere in documentation and also note that regular inheritance
>> semantics does not apply. While it sounds probably fine as implementation
>> for the feature released initially, a day will inevitably come when this
>> behind-the-scenes implementation will be changed to something more
>> amenable to better optimization. But that's for future...
>
> This seems pretty vague to me. I don't know that I have an opinion
> without a more specific list of questions.

Sorry for rather incoherent articulation in the previous message. Let me
try to say something that's perhaps more specific.

Consider that we create partitions as inheritance children, that is,
creating a partition using:

CREATE TABLE partition PARTITION OF parent FOR VALUES ...

is equivalent to saying

CREATE TABLE partition (...) INHERITS (parent)

except that the latter allows partition to have its own column
definitions, table constraints and multiple parents as long as things are
conflict-free. The former requires specifying partition bounding values.
The common thing between the two then is StoreCatalogInheritance(partOid,
parentOid) that will mark partition as inheritance child of parent in
pg_inherits. So, our new partitions are really inheritance children but
that's not apparent to users (where this last bit is important).

Then consider ALTER TABLE partition - should we need to handle it in way
different from existing inheritance code would do -

* Prevent something whereas regular inheritance wouldn't?
* Do something instead of/in addition to whatever regular inheritance does?

Consider adding a column to partition - regular inheritance wouldn't
prevent it because adding a column to child table doesn't get in the way
of how inheritance works. Now we can make it so that if table is a
partition created with CRATE TABLE PARTITION OF (which we can tell because
the table has pg_partition entry), we should *actively* prevent adding a
column. Inheritance already prevents dropping an inherited attribute in
child but says "cannot drop inherited column". IMO in a partition's case,
it should say "cannot drop columns of a partition".

OTOH, adding a column to parent propagates to all partitions by way of
ATSimpleRecursion(). But not everything that ATSimpleRecursion() does is
necessary for partitioned tables (which we can tell from relkind or maybe
because table has pg_partitioned_rel entry) - especially no need to
propagate column default and check constraint, etc. Because one cannot
apply insert/update on partitions anyway. Then because defaults and check
constraints on partitions are really useless, when a user adds them to a
partition directly, we should say, "cannot alter column default for a
partition" and "cannot add check constraint to a partition" with "instead
do that operation on the root parent" hint. But what is a "root parent"?
How to determine one for a given partition? Perhaps, it's not much of a
POLA violation a user must have done something to have ended up in the
error occurring, that is, created the table as partition of (partition of
...) something.

Thanks a lot for your time!

Thanks,
Amit

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-02-18 07:27:13 Re: WAL logging problem in 9.4.3?
Previous Message Michael Paquier 2016-02-18 04:14:55 Re: ALTER ROLE SET/RESET for multiple options