Re: When to use PARTITION BY HASH?

From: Oleksandr Shulgin <oleksandr(dot)shulgin(at)zalando(dot)de>
To: Michel Pelletier <pelletier(dot)michel(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: When to use PARTITION BY HASH?
Date: 2020-06-08 08:40:15
Message-ID: CACACo5SSz1YVDQSw1nVZ5AJmi23NXVaK4rammnzKMdN=j2rQZQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-performance

On Sat, Jun 6, 2020 at 6:14 PM Michel Pelletier <pelletier(dot)michel(at)gmail(dot)com>
wrote:

>
> Well lets take a step back here and look at the question, hash
> partitioning exists in Postgres, is it useful? While I appreciate the need
> to see a fact demonstrated, and generally avoiding argument by authority,
> it is true that many of the very smartest database people in the world
> conceived of, discussed, implemented and documented this feature for us.
> It stands to reason that it is useful, or it wouldn't exist. So maybe this
> is more about finding or needing better partitioning documentation.
>

Fair point.

I've found the original commit adding this feature in version 11:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1aba8e651ac3e37e1d2d875842de1e0ed22a651e
It says:

"Hash partitioning is useful when you want to partition a growing data
set evenly. This can be useful to keep table sizes reasonable, which
makes maintenance operations such as VACUUM faster, or to enable
partition-wise join."

It also includes a link to discussion, though that starts in the middle of
a long thread.
The original thread is here:
https://www.postgresql.org/message-id/flat/20170228233313.fc14d8b6.nagata%40sraoss.co.jp

However, these threads only argue about implementation details and it's not
easy to find a discussion of motivation for this particular partitioning
scheme support.
I guess it was quite obvious to the participants at that point already.

With hash partitioning you are not expected, in general, to end up with a
>> small number of partitions being accessed more heavily than the rest. So
>> your indexes will also not fit into memory.
>>
>
> Indexes are not (usually) constant time structures, they take more time
> the bigger they get. So partitioned indexes will be smaller, quicker to
> insert into, and quicker to vacuum, and also gain possible pruning
> advantages on query when you split them up. If the planner can, knowing
> the key, exclude all but one partition, it won't even look at the other
> tables, so if you hash partition by primary key, you reduce the search
> space to 1/N immediately.
>
> Indexes with high update activity also suffer from a problem called "index
> bloat" where spares "holes" get punched in the buckets of btree indexes
> from updates and delete deletes. These holes are minimized by vacuuming
> but the bigger the index gets, the harder that process is to maintain.
> Smaller indexes suffer less from index bloat, and remedying the situation
> is easier because you can reindex partitions independently of each other.
> Your not just reducing the query load to an Nth, you're reducing the
> maintenance load.
>

Thanks for taking your time to explain it in detail. Though I do not tend
to believe the insert/scan performance benefit is measurable without trying
it, I do see the benefits for maintenance.

I have the feeling that using a hash function to distribute rows simply
>> contradicts the basic assumption of when you would think of partitioning
>> your table at all: that is to make sure the most active part of the table
>> and indexes is small enough to be cached in memory.
>>
>
> I think you might be framing this with a specific data pattern in mind,
> not all data distributions have a "most active" or power law distribution
> of data.
>

I'm just referring to the first bullet-point in the docs:

"Query performance can be improved dramatically in certain situations,
particularly when most of the heavily accessed rows of the table are in a
single partition or a small number of partitions. The partitioning
substitutes for leading columns of indexes, reducing index size and making
it more likely that the heavily-used parts of the indexes fit in memory."

I think it does not apply to hash partitioning in the general case.

--
Alex

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Daniel Verite 2020-06-08 09:05:02 Re: Concurrenctly running CREATE TEMP TABLE IF NOT EXISTS [...] AS [...]
Previous Message Thomas Kellerer 2020-06-08 08:18:41 Re: checking existence of a table before updating its SERIAL

Browse pgsql-performance by date

  From Date Subject
Next Message Michaeldba@sqlexec.com 2020-06-08 09:50:35 Re: When to use PARTITION BY HASH?
Previous Message David Rowley 2020-06-07 21:23:10 Re: When to use PARTITION BY HASH?