Re: Many thousands of partitions

From: "Gabriel E(dot) Sánchez Martínez" <gabrielesanchez(at)gmail(dot)com>
To: Grzegorz Tańczyk <goliatus(at)polzone(dot)pl>, pgsql-general(at)postgresql(dot)org
Subject: Re: Many thousands of partitions
Date: 2013-10-08 22:47:16
Message-ID: 52548B74.7030801@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


On 10/08/2013 11:23 AM, Grzegorz Tańczyk wrote:
> Hello,
>
> I have question regarding one of caveats from docs:
> http://www.postgresql.org/docs/8.3/static/ddl-partitioning.html
>
> "Partitioning using these techniques will work well with up to perhaps
> a hundred partitions; don't try to use many thousands of partitions."
>
> What's the alternative? Nested partitioning could do the trick? I have
> milions of rows(numbers, timestamps and text(<4kb), which are
> frequently updated and there are also frequent inserts. Partitioning
> was my first thought about solution of this problem. I want to avoid
> long lasting locks, index rebuild problems and neverending vacuum.
> Write performance may be low if at the same time I will have no
> problem selecting single rows using primary key(bigint).Partitioning
> seems to be the solution, but I'm sure I will end up with several
> thousands of automatically generated partitions.
I had the same question a short while back and actually tested some of
it. My test data set was on an unpartitioned table having about 432
million rows and taking about 46 GB. In my case the data partitions
nicely by date. The table had about a year of data, spread over two
years, and I was curious to see how much partioning could improve query
speeds vs the unpartitioned table with an index on a timestamp field. I
tested partitioning by month (24 partitions) and also by day (over 700
partitions), in both cases using non-overlapping CHECK constraints to
enable constraint exclusion. Both partitioning schemes worked
considerably faster than the single table (on the order of 200 ms vs
2500 ms), and were not significantly far from each other in
performance. Since more partitions translates to more management hasle,
I concluded that partitioning by month was the best option.

I also tested the partitioning idea with a script that did a partition
per day for 20 years. Most of the tables were empty because I used the
same data for this test. I found that CREATE and DROP took very long,
while SELECT queries (on over 7000 thousand of partitions) were still
faster than on the single table but a bit slower than the schemes with
only a few partitions. With more complex SELECT queries the performance
might be much worse.

I also tested the nested partitioning idea on partitions by day over two
years, the idea being that since CHECK constraints are inherited the
query planner should be able to exclude sets of tables at once, making
thousands of partitions feasible. I was not able to detect a
significant change in the time it took to SELECT on the tables for two
years, so I tested it with tables for 20 and 30 years. If nested
partitioning worked, having thousands of partitions would not
appreciably slow down the query, but it did a bit, so it is clear that
the query planner still scans all partitions down the inheritance tree.
Perhaps this is because one can override the default behavior of
inheriting CHECK constraints, so a child table does not necesarilly
observe a parent's CHECK. I suppose that with a bit of bookkeeping the
query planner could exploit nested partitioning, but that is not how it
currently behaves. Maybe this could be considered for a future
performance feature of PostgreSQL, but it would only be useful in big
data applications with several TB of data archival.

I was not using INSERT triggers to distribute data across partitions.
Instead I left that logic to a Java application that relied on the
naming convention for tables. I imagine that things can slow down
considerably as the number of partitions increases if you are inserting
data with triggers.

Partioning seems to be a good idea if a single table would be too big to
fit in your server's file cache, and also for management, since you can
drop partitions of old data and create new ones without having to
reindex and lock. Does your data partition nicely by date, for
example? If most of the inserts are new data and old data is
read-mostly, then partitioning may make sense because you would not have
to reindex old partitions. In fact, you could very well not have an
index on the hot, write-mostly partition of, say, the current month,
until the write activity on that table diminishes, which would make
inserts faster. If, on the other hand, your writes are scattered across
many partitions, a single large table with an index may be a better
solution.

>
> Thanks
>
> --
> Regards,
> Grzegorz

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Steve Crawford 2013-10-08 23:10:50 Re: ERROR: invalid value "????" for "YYYY"
Previous Message Rowan Collins 2013-10-08 22:07:25 Re: ERROR: invalid value "????" for "YYYY"