Re: Table Partitioning Feature

From: Amit Gupta <amit(dot)pc(dot)gupta(at)gmail(dot)com>
To: Emmanuel Cecchet <manu(at)frogthinker(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org, swm(at)alcove(dot)com(dot)au
Subject: Re: Table Partitioning Feature
Date: 2009-02-10 13:11:31
Message-ID: 8d79a95c0902100511m4c5fab4fg76342b8cd5c4bb9b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for your feedback, Emmanuel.
Here are my comments:

On 2/10/09, Emmanuel Cecchet <manu(at)frogthinker(dot)org> wrote:
> Hi Amit,
>
> I will be traveling until next Tuesday and will have no access to email so
> don't be surprised if I don't follow up this week.
> The overall approach seems sound. The metadata table should help also for
> DDL to find out overlapping ranges or duplicate list entries.

We are checking for overlaps in the partition keys before creating partitions.

> So far, I have not tried to use the SPI interface from a C trigger so I
> don't see any disadvantage yet. We would have to assess the performance to
> make sure it's not going to be a show stopper.
> I think that the main issue of the trigger approach is that other triggers
> might interfere. The 'partition trigger' must be the last of the 'before
> insert' triggers and if the destination (child) table has a trigger, we must
> ensure that this trigger is not going to require a new routing.
> Another issue is the result that will be returned by insert/copy commands
> if all tuples are moved to other tables, the result will be 0. We might want
> to have stats that would collect where tuples where moved for a particular
> command (I don't know yet what would be the best place to collect these
> stats but they could probably be updated by the trigger).

Row movements will be done by firing deletes and inserts. We will
investigte on how these stats can be maintained and displayed.

> Also would the trigger be attached to all tables in the hierarchy or only
> to the top parent?
> What kind of query would you use with more than 1 level of inheritance
> (e.g. parent=year, child=month, grand-child=day)? It looks like we have to
> parse the leaves of the graph but intermediate nodes would help accelerating
> the search.
>

We haven't yet planned for supporting multi-level partitioning.
However, the pg_partition table can be extented to store "partlevel"
column (to represent depth of partition from the root), and we should
just select the leaf level partitions in the SQL that finds target
partition.
(This is with the assumption that only leaf level partitions will have
the data.)

> An alternative approach (I haven't assessed the feasibility yet) would be
> to try to call the query planner. If we ask to select the partition value of
> the tuple, the query planner should return the table it is going to scan (as
> in EXPLAIN SELECT * FROM t WHERE key=$1).
>

That's a good idea. We will have to anyway write this code for planner
module to find relevant partitions for 'SELECT' queries.

Another question i have is - should we create a separate C file and
shared library for the partition trigger functions, or can we bundle
it with one of the existing libraries?

Thanks,
Amit

> Let me know what you think,
>
> Emmanuel
>
>
> > We are considering to following approach:
> > 1. metadata table pg_partitions is defined as follows:
> > CATALOG(pg_partitions,2336) BKI_WITHOUT_OIDS
> > {
> > Oid partrelid; // partition table Oid
> > Oid parentrelid; // Parent table Oid
> > int4 parttype; // Type of partition, list, hash, range
> > Oid partkey; // partition key Oid
> > Oid keytype; /// type of partition key.
> > int4 keyorder /// order of the key in multi-key partitions.
> > text min;
> > text max; // min and max for range parti
> > text[] list;
> > int hash; // hash value
> > } FormData_pg_partitions;
> >
> >
> > 2. C triggers will fire a query on this table to get the relevant
> > partition of the inserted/updated data using SPI interface. The query
> > will look something like (for range partitioning)
> >
> > select min(partrelid)
> > from pg_partitions
> > where parentrelid = 2934 // we know this value
> > and (
> > ( $1 between to_int(min ) and to_int(max) and
> > keyorder = 1) OR
> > ($2 between to_date (min) and to_date (max) and
> > keyorder =2 )
> > ....
> > )
> > group by
> > parentrelid
> > having
> > count(*) = <number of partition keys>
> >
> > $1, $2, ... are the placeholders of the actual partition key values of
> > trigger tuple.
> >
> > Since we know the type of partition keys, and the parentrelid, this
> > kind of query string can be saved in another table say, pg_part_map.
> > And its plan can be parsed once and saved in cache to be reused.
> > Do you see any issue with using SPI interface within triggers?
> >
> > The advantage of this kind of approah is that trigger code can be made
> > genric for any kind of partition table.
> >
> > Thanks,
> > Amit
> > Persistent Systems,
> > www.persistentsys.com
> >
> >
> >
> >
> >
> > On 1/23/09, Emmanuel Cecchet <manu(at)frogthinker(dot)org> wrote:
> >
> >
> > > Amit,
> > >
> > > You might want to put this on the
> > > http://wiki.postgresql.org/wiki/Table_partitioning wiki
> > > page.
> > > How does your timeline look like for this implementation?
> > > I would be happy to contribute C triggers to your implementation. From
> what
> > > I understood in
> > >
> http://archives.postgresql.org/pgsql-hackers/2008-01/msg00269.php,
> > > you already have an implementation that parses the grammar and generates
> > > rules as if someone had written them. Is this code available?
> > >
> > > Regarding the use of triggers to push/move data to partitions, what if
> > > someone declares triggers on partitions? Especially if you have
> > > subpartitions, let's consider the case where there is a trigger on the
> > > parent, child and grandchild. If I do an insert in the parent, the user
> > > trigger on the parent will be executed, then the partition trigger that
> > > decides to move to the grandchild. Are we going to bypass the child
> trigger?
> > > If we also want fast COPY operations on partitioned table, we could
> have an
> > > optimized implementation that could bypass triggers and move the tuple
> > > directly to the appropriate child table.
> > >
> > > Thanks for this big contribution,
> > > Emmanuel
> > >
> > >
> > >
> > >
> > > > Hi,
> > > >
> > > > We are implementing table partitioning feature to support
> > > > - the attached commands. The syntax conforms to most of the suggestion
> > > >
> > > >
> > > mentioned in
> > >
> http://archives.postgresql.org/pgsql-hackers/2008-01/msg00413.php,
> > > barring the following:
> > >
> > >
> > > > -- Specification of partition names is optional. System will be able
> to
> > > >
> > > >
> > > generate partition names in such cases.
> > >
> > >
> > > > -- sub partitioning
> > > > We are using pgsql triggers to push/move data to appropriate
> partitions,
> > > >
> > > >
> > > but we will definitely consider moving to C language triggers as
> suggested
> > > by manu.
> > >
> > >
> > > > - Global non-partitioned indexes (that will extend all the
> partitions).
> > > > - Foreign key support for tables referring to partitioned tables.
> > > >
> > > > Please feel free to post your comments and suggestions.
> > > >
> > > > Thanks,
> > > > Amit
> > > > Persistent Systems
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> ------------------------------------------------------------------------
> > >
> > >
> > > >
> > > >
> > >
> >
> >
> >
>
>
> --
> Emmanuel Cecchet
> FTO @ Frog Thinker Open Source Development & Consulting
> --
> Web: http://www.frogthinker.org
> email: manu(at)frogthinker(dot)org
> Skype: emmanuel_cecchet
>
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2009-02-10 13:53:35 Re: temporarily stop autovacuum
Previous Message KaiGai Kohei 2009-02-10 12:08:27 Re: SE-PostgreSQL and row level security