From: | Oleksandr Shulgin <oleksandr(dot)shulgin(at)zalando(dot)de> |
---|---|
To: | Michel Pelletier <pelletier(dot)michel(at)gmail(dot)com> |
Cc: | pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org> |
Subject: | Re: When to use PARTITION BY HASH? |
Date: | 2020-06-08 08:40:15 |
Message-ID: | CACACo5SSz1YVDQSw1nVZ5AJmi23NXVaK4rammnzKMdN=j2rQZQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general pgsql-performance |
On Sat, Jun 6, 2020 at 6:14 PM Michel Pelletier <pelletier(dot)michel(at)gmail(dot)com>
wrote:
>
> Well lets take a step back here and look at the question, hash
> partitioning exists in Postgres, is it useful? While I appreciate the need
> to see a fact demonstrated, and generally avoiding argument by authority,
> it is true that many of the very smartest database people in the world
> conceived of, discussed, implemented and documented this feature for us.
> It stands to reason that it is useful, or it wouldn't exist. So maybe this
> is more about finding or needing better partitioning documentation.
>
Fair point.
I've found the original commit adding this feature in version 11:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1aba8e651ac3e37e1d2d875842de1e0ed22a651e
It says:
"Hash partitioning is useful when you want to partition a growing data
set evenly. This can be useful to keep table sizes reasonable, which
makes maintenance operations such as VACUUM faster, or to enable
partition-wise join."
It also includes a link to discussion, though that starts in the middle of
a long thread.
The original thread is here:
https://www.postgresql.org/message-id/flat/20170228233313.fc14d8b6.nagata%40sraoss.co.jp
However, these threads only argue about implementation details and it's not
easy to find a discussion of motivation for this particular partitioning
scheme support.
I guess it was quite obvious to the participants at that point already.
With hash partitioning you are not expected, in general, to end up with a
>> small number of partitions being accessed more heavily than the rest. So
>> your indexes will also not fit into memory.
>>
>
> Indexes are not (usually) constant time structures, they take more time
> the bigger they get. So partitioned indexes will be smaller, quicker to
> insert into, and quicker to vacuum, and also gain possible pruning
> advantages on query when you split them up. If the planner can, knowing
> the key, exclude all but one partition, it won't even look at the other
> tables, so if you hash partition by primary key, you reduce the search
> space to 1/N immediately.
>
> Indexes with high update activity also suffer from a problem called "index
> bloat" where spares "holes" get punched in the buckets of btree indexes
> from updates and delete deletes. These holes are minimized by vacuuming
> but the bigger the index gets, the harder that process is to maintain.
> Smaller indexes suffer less from index bloat, and remedying the situation
> is easier because you can reindex partitions independently of each other.
> Your not just reducing the query load to an Nth, you're reducing the
> maintenance load.
>
Thanks for taking your time to explain it in detail. Though I do not tend
to believe the insert/scan performance benefit is measurable without trying
it, I do see the benefits for maintenance.
I have the feeling that using a hash function to distribute rows simply
>> contradicts the basic assumption of when you would think of partitioning
>> your table at all: that is to make sure the most active part of the table
>> and indexes is small enough to be cached in memory.
>>
>
> I think you might be framing this with a specific data pattern in mind,
> not all data distributions have a "most active" or power law distribution
> of data.
>
I'm just referring to the first bullet-point in the docs:
"Query performance can be improved dramatically in certain situations,
particularly when most of the heavily accessed rows of the table are in a
single partition or a small number of partitions. The partitioning
substitutes for leading columns of indexes, reducing index size and making
it more likely that the heavily-used parts of the indexes fit in memory."
I think it does not apply to hash partitioning in the general case.
--
Alex
From | Date | Subject | |
---|---|---|---|
Next Message | Daniel Verite | 2020-06-08 09:05:02 | Re: Concurrenctly running CREATE TEMP TABLE IF NOT EXISTS [...] AS [...] |
Previous Message | Thomas Kellerer | 2020-06-08 08:18:41 | Re: checking existence of a table before updating its SERIAL |
From | Date | Subject | |
---|---|---|---|
Next Message | Michaeldba@sqlexec.com | 2020-06-08 09:50:35 | Re: When to use PARTITION BY HASH? |
Previous Message | David Rowley | 2020-06-07 21:23:10 | Re: When to use PARTITION BY HASH? |