From: | amul sul <sulamul(at)gmail(dot)com> |
---|---|
To: | Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: [POC] hash partitioning |
Date: | 2017-03-02 13:03:42 |
Message-ID: | CAAJ_b94a=sVPBRDSQ6tKjFJEoRQpBdZZFBacZU0Q2+Z+xQuA_A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Mar 1, 2017 at 3:50 PM, Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp> wrote:
> [....]
>
> I Agree that it is unavoidable partitions number in modulo hashing,
> > but we can do in other hashing technique. Have you had thought about
> > Linear hashing[1] or Consistent hashing[2]? This will allow us to
> > add/drop
> > partition with minimal row moment.
>
> Thank you for your information of hash technique. I'll see them
> and try to allowing the number of partitions to be changed.
>
>
Thanks for showing interest, I was also talking about this with Robert Haas
and
hacking on this, here is what we came up with this.
If we want to introduce hash partitioning without syntax contort and minimal
movement while changing hash partitions (ADD-DROP/ATTACH-DETACH operation),
at start I thought we could pick up linear hashing, because of in both the
hashing we might need to move approx tot_num_of_tuple/tot_num_of_partitions
tuples at adding new partition and no row moment required at dropping
partitioning.
With further thinking and talking through the idea of using linear hashing
with my team, we realized that has some problems specially during pg_dump
and pg_upgrade. Both a regular pg_dump and the binary-upgrade version of
pg_dump which is used by pg_restore need to maintain the identity of the
partitions. We can't rely on things like OID order which may be unstable, or
name order which might not match the order in which partitions were added.
So
somehow the partition position would need to be specified explicitly.
So later we came up with some syntax like this (just fyi, this doesn't add
any new keywords):
create table foo (a integer, b text) partition by hash (a);
create table foo1 partition of foo with (modulus 4, remainder 0);
create table foo2 partition of foo with (modulus 8, remainder 1); --
legal, modulus doesn't need to match
create table foo3 partition of foo with (modulus 8, remainder 4); --
illegal, overlaps foo1
Here we need to enforce a rule that every modulus must be a factor of the
next
larger modulus. So, for example, if you have a bunch of partitions that all
have
modulus 5, you can add a new partition with modulus 10 or a new partition
with
modulus 15, but you cannot add both a partition with modulus 10 and a
partition
with modulus 15, because 10 is not a factor of 15. However, you could
simultaneously use modulus 4, modulus 8, modulus 16, and modulus 32 if you
wished, because each modulus is a factor of the next larger one. You could
also use modulus 10, modulus 20, and modulus 60. But you could not use
modulus
10, modulus 15, and modulus 60, because while both of the smaller module are
factors of 60, it is not true that each is a factor of the next.
Other advantages with this rule are:
1. Dropping (or detaching) and adding (or attaching) a partition can never
cause the rule to be violated.
2. We can easily build a tuple-routing data structure based on the largest
modulus.
For example: If the user has
partition 1 with (modulus 2, remainder 1),
partition 2 with (modulus 4, remainder 2),
partition 3 with (modulus 8, remainder 0) and
partition 4 with (modulus 8, remainder 4),
then we can build the following tuple routing array in the relcache:
== lookup table for hashvalue % 8 ==
0 => p3
1 => p1
2 => p2
3 => p1
4 => p4
5 => p1
6 => p2
7 => p1
3. It's also quite easy to test with a proposed new partition overlaps with
any
existing partition. Just build the mapping array and see if you ever end up
trying to assign a partition to a slot that's already been assigned to some
other partition.
We can still work on the proposed syntax - and I am open for suggestions.
One
more thought is to use FOR VALUES HAVING like:
CREATE TABLE foo1 PARTITION OF foo FOR VALUES HAVING (modulus 2, remainder
1);
But still more thoughts/inputs welcome here.
Attached patch implements former syntax, here is quick demonstration:
1.CREATE :
create table foo (a integer, b text) partition by hash (a);
create table foo1 partition of foo with (modulus 2, remainder 1);
create table foo2 partition of foo with (modulus 4, remainder 2);
create table foo3 partition of foo with (modulus 8, remainder 0);
create table foo4 partition of foo with (modulus 8, remainder 4);
2. Display parent table info:
postgres=# \d+ foo
Table "public.foo"
Column | Type | Collation | Nullable | Default | Storage | Stats
target | Description
--------+---------+-----------+----------+---------+----------+--------------+-------------
a | integer | | | | plain |
|
b | text | | | | extended |
|
Partition key: HASH (a)
Partitions: foo1 WITH (modulus 2, remainder 1),
foo2 WITH (modulus 4, remainder 2),
foo3 WITH (modulus 8, remainder 0),
foo4 WITH (modulus 8, remainder 4)
3. Display child table info:
postgres=# \d+ foo1
Table "public.foo1"
Column | Type | Collation | Nullable | Default | Storage | Stats
target | Description
--------+---------+-----------+----------+---------+----------+--------------+-------------
a | integer | | | | plain |
|
b | text | | | | extended |
|
Partition of: foo WITH (modulus 2, remainder 1)
4. INSERT:
postgres=# insert into foo select i, 'abc' from generate_series(1,10) i;
INSERT 0 10
postgres=# select tableoid::regclass as part, * from foo;
part | a | b
------+----+-----
foo1 | 3 | abc
foo1 | 4 | abc
foo1 | 7 | abc
foo1 | 10 | abc
foo2 | 1 | abc
foo2 | 2 | abc
foo2 | 9 | abc
foo3 | 6 | abc
foo4 | 5 | abc
foo4 | 8 | abc
(10 rows)
TODOs.
1. Maybe need some work in the CREATE TABLE .. PARTITION OF .. syntax.
2. Trim regression tests (if require).
3. Documentation
Thoughts/Comments?
Attachment | Content-Type | Size |
---|---|---|
hash-partitioning_another_design-v1.patch | application/octet-stream | 77.8 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Ashutosh Bapat | 2017-03-02 13:18:30 | Re: Print correct startup cost for the group aggregate. |
Previous Message | Robert Haas | 2017-03-02 12:48:20 | Re: Partitioned tables and relfilenode |