Quick Links

Re: When to use PARTITION BY HASH?

From:	Oleksandr Shulgin <oleksandr(dot)shulgin(at)zalando(dot)de>
To:	Michel Pelletier <pelletier(dot)michel(at)gmail(dot)com>
Cc:	pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject:	Re: When to use PARTITION BY HASH?
Date:	2020-06-08 08:40:15
Message-ID:	CACACo5SSz1YVDQSw1nVZ5AJmi23NXVaK4rammnzKMdN=j2rQZQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general pgsql-performance

On Sat, Jun 6, 2020 at 6:14 PM Michel Pelletier <pelletier(dot)michel(at)gmail(dot)com>
wrote:

>
> Well lets take a step back here and look at the question, hash
> partitioning exists in Postgres, is it useful? While I appreciate the need
> to see a fact demonstrated, and generally avoiding argument by authority,
> it is true that many of the very smartest database people in the world
> conceived of, discussed, implemented and documented this feature for us.
> It stands to reason that it is useful, or it wouldn't exist. So maybe this
> is more about finding or needing better partitioning documentation.
>

Fair point.

I've found the original commit adding this feature in version 11:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1aba8e651ac3e37e1d2d875842de1e0ed22a651e
It says:

"Hash partitioning is useful when you want to partition a growing data
set evenly. This can be useful to keep table sizes reasonable, which
makes maintenance operations such as VACUUM faster, or to enable
partition-wise join."

It also includes a link to discussion, though that starts in the middle of
a long thread.
The original thread is here:
https://www.postgresql.org/message-id/flat/20170228233313.fc14d8b6.nagata%40sraoss.co.jp

However, these threads only argue about implementation details and it's not
easy to find a discussion of motivation for this particular partitioning
scheme support.
I guess it was quite obvious to the participants at that point already.

With hash partitioning you are not expected, in general, to end up with a
>> small number of partitions being accessed more heavily than the rest. So
>> your indexes will also not fit into memory.
>>
>
> Indexes are not (usually) constant time structures, they take more time
> the bigger they get. So partitioned indexes will be smaller, quicker to
> insert into, and quicker to vacuum, and also gain possible pruning
> advantages on query when you split them up. If the planner can, knowing
> the key, exclude all but one partition, it won't even look at the other
> tables, so if you hash partition by primary key, you reduce the search
> space to 1/N immediately.
>
> Indexes with high update activity also suffer from a problem called "index
> bloat" where spares "holes" get punched in the buckets of btree indexes
> from updates and delete deletes. These holes are minimized by vacuuming
> but the bigger the index gets, the harder that process is to maintain.
> Smaller indexes suffer less from index bloat, and remedying the situation
> is easier because you can reindex partitions independently of each other.
> Your not just reducing the query load to an Nth, you're reducing the
> maintenance load.
>

Thanks for taking your time to explain it in detail. Though I do not tend
to believe the insert/scan performance benefit is measurable without trying
it, I do see the benefits for maintenance.

I have the feeling that using a hash function to distribute rows simply
>> contradicts the basic assumption of when you would think of partitioning
>> your table at all: that is to make sure the most active part of the table
>> and indexes is small enough to be cached in memory.
>>
>
> I think you might be framing this with a specific data pattern in mind,
> not all data distributions have a "most active" or power law distribution
> of data.
>

I'm just referring to the first bullet-point in the docs:

"Query performance can be improved dramatically in certain situations,
particularly when most of the heavily accessed rows of the table are in a
single partition or a small number of partitions. The partitioning
substitutes for leading columns of indexes, reducing index size and making
it more likely that the heavily-used parts of the indexes fit in memory."

I think it does not apply to hash partitioning in the general case.

--
Alex

In response to

Re: When to use PARTITION BY HASH? at 2020-06-06 16:13:25 from Michel Pelletier

Responses

Re: When to use PARTITION BY HASH? at 2020-06-08 13:07:04 from Ron

Browse pgsql-general by date

	From	Date	Subject
Next Message	Daniel Verite	2020-06-08 09:05:02	Re: Concurrenctly running CREATE TEMP TABLE IF NOT EXISTS [...] AS [...]
Previous Message	Thomas Kellerer	2020-06-08 08:18:41	Re: checking existence of a table before updating its SERIAL

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Michaeldba@sqlexec.com	2020-06-08 09:50:35	Re: When to use PARTITION BY HASH?
Previous Message	David Rowley	2020-06-07 21:23:10	Re: When to use PARTITION BY HASH?