From: | Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> |
---|---|
To: | Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Cc: | Claudio Freire <klaussfreire(at)gmail(dot)com> |
Subject: | Partitioning: issues/ideas (Was: Re: On partitioning) |
Date: | 2015-01-15 02:07:43 |
Message-ID: | 54B720EF.7040503@lab.ntt.co.jp |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 06-01-2015 PM 03:40, Amit Langote wrote:
>
> I agree that while we are discussing these points, we could also be
> discussing how we solve problems of existing partitioning implementation
> using whatever the above things end up being. Proposed approaches to
> solve those problems might be useful to drive the first step as well or
> perhaps that's how it should be done anyway.
>
I realize the discussion has not quite brought us to *conclusions* so
far though surely there has been valuable input from people. Anyway,
starting a new thread with the summary of what has been (please note
that the order of listing the points does not necessarily connote the
priority):
* It has been repeatedly pointed out that we may want to decouple
partitioning from inheritance because implementing partitioning as an
extension of inheritance mechanism means that we have to keep all the
existing semantics which might limit what we want to do with the special
case of it which is partitioning; in other words, we would find
ourselves in difficult position where we have to inject a special case
code into a very generalized mechanism that is inheritance.
Specifically, do we regard a partitions as pg_inherits children of its
partitioning parent?
* Syntax: do we want to make it similar to one of the many other
databases out there? Or we could invent our own? I like the syntax that
Robert suggested that covers the cases of RANGE and LIST partitioning
without actually having to use those keywords explicitly; something like
the following:
CREATE TABLE parent PARTITION ON (column [ USING opclass ] [, ... ]);
CREATE TABLE child PARTITION OF parent_name
FOR VALUES { (value, ...) [ TO (value, ...) ] }
So instead of making a hard distinction between range and list
partitioning, you can say:
CREATE TABLE child_name PARTITION OF parent_name FOR VALUES (3, 5, 7);
wherein, child is effectively a LIST partition
CREATE TABLE child PARTITION OF parent_name FOR VALUES (8) TO (12);
wherein, child is effectively a RANGE partition on one column
CREATE TABLE child PARTITION OF parent_name FOR VALUES(20, 120) TO (30,
130);
wherein, child is effectively a RANGE partition on two columns
I wonder if we could add a clause like DISTRIBUTED BY to complement
PARTITION ON that represents a hash distributed/partitioned table (that
could be a syntax to support sharded tables maybe; we would definitely
want to move ahead in that direction I guess)
* Catalog: We would like to have a catalog structure suitable to
implement capabilities like multi-column partitioning, sub-partitioning
(with arbitrary number of levels in the hierarchy). I had suggested
that we create two new catalogs viz. pg_partitioned_rel,
pg_partition_def to store metadata about a partition key of a
partitioned relation and partition bound info of a partition,
respectively. Also, see the point about on-disk representation of
partition bounds
* It is desirable to treat partitions as pg_class relations with perhaps
a new relkind(s). We may want to choose an implementation where only the
lowest level relations in a partitioning hierarchy have storage; those
at the upper layers are mere placeholder relations though of course with
associated constraints determined by partitioning criteria (with
appropriate metadata entered into the additional catalogs). I am not
quite sure if each kind of the relations involved in the partitioning
scheme have separate namespaces and, if they are, how we implement that
* In the initial implementation, we could just live with partitioning on
a set of columns (and not arbitrary expressions of them)
* We perhaps do not need multi-column LIST partitions as they are not
very widely used and may complicate the implementation
* There are a number of suggestions about how we represent partition
bounds (on-disk) - pg_node_tree, RECORD (a composite type or the rowtype
associated with the relation itself), etc. Important point to consider
here may be that partition key may contain more than one column
* How we represent partition definition in memory (for a given
partitioned relation) - important point to remember is that such a
representation should be efficient to iterate through or
binary-searchable. Also see the points about tuple-routing and
partition-pruning
* Overflow/catchall partition: it seems we do not want/need them. It
might seem desirable for example in cases where a big transaction enters
a large number of tuples all but one of which find a defined partition;
we may not want to error out in such case but instead enter that erring
tuple into the overflow partition instead. If we choose to implement
that, we would like to also implement the capability to move the tuples
into the appropriate partition once it's defined. Related is the notion
of automatically creating partitions if one is not already defined for a
just entered tuple; but there may be locking troubles if many concurrent
sessions try to do that
* Tuple-routing: based on the internal representation of partition
bounds for the partitions of a given partitioned table, there should be
a way to map a just entered tuple to partition id it belongs to. Below
mentioned BRIN-like machinery could be made to work
* Partition-pruning: again, based on the internal representation of
partition bounds for the partitions of a given partitioned table, there
should be a way to prune partitions deemed unnecessary per scan quals.
One notable suggestion is to consider BRIN (-like) machinery. For
example, it is able to tell from the scan quals whether a particular
block range of a given heap needs to be scanned or not based on summary
info index tuple for the block range. Though, the interface is currently
suitable to cover a single heap with blocks in range 0 to N-1 of that
heap. What we are looking for here is a hypothetical PartitionMemTuple
(or PartitionBound) that is a summary of a whole relation (in this case,
the partition) NOT a block range. But I guess the infrastructure is
generalized enough that we could make that work. Related then would be
an equivalent of ScanKey for the partitioning case. Just as ScanKeyData
has correspondence with the index being used, the hypothetical
PartitionScanKeyData (which may be an entirely bad/half-baked idea!)
would represent the application of comparison operator between table
column (partitioning key column) and a constant (as per quals).
Please help bridge the gap in my understanding of these points. I hope
we can put the discussion on a concrete footing so that it leads to a
way towards implementation sooner than later. Some points need more
immediate attention as we would like to first tackle the issue of
partition metadata. Reusing existing infrastructure should be encouraged
with obvious enhancements as we find fit. I am beginning to feel there
is a need to prototype a good enough solution that incorporates the
suggestions that have been already provided or will be provided. It may
be the only way forward though I think it definitely worthwhile to spend
some time to arrive at such a set of good enough ideas on various aspects.
Thanks,
Amit
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2015-01-15 02:27:01 | Re: Out-of-bounds write and incorrect detection of trigger file in pg_standby |
Previous Message | Andres Freund | 2015-01-15 02:03:35 | Overhauling our interrupt handling (was Escaping from blocked send() reprised.) |