Re: Changed behavior in rewriteheap

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Erik Nordström <erik(at)timescale(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Changed behavior in rewriteheap
Date: 2024-11-21 23:30:14
Message-ID: CAEze2Wj3t_N_i=vNm3yHB0TLtgsieYx0_uSMqLqN2Y8qAAbJkw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 21 Nov 2024, 17:18 Erik Nordström, <erik(at)timescale(dot)com> wrote:

> Hello,
>
> I've noticed a change in behavior of the heap rewrite functionality in
> PostgreSQL 17, used by, e.g., CLUSTER. I've been experimenting with the
> functionality to implement a way to merge partitions in TimescaleDB. I am
> using table_relation_copy_for_cluster() to write the data of several tables
> to a single merged table, and then I do a heap swap on one of the original
> tables while dropping the others. So, if you have 3 partitions and want to
> merge them to one, then I write all three partitions to a temporary heap,
> swap the new heap on partition 1 and then drop partitions 2 and 3.
>
> Now, this approach worked fine for PostgreSQL 15 and 16, but 17 introduced
> some changes that altered the behavior so that I only see data from one of
> the partitions after merge (the last one written).
>
> The commit that I think is responsible is the following:
> https://github.com/postgres/postgres/commit/8af256524893987a3e534c6578dd60edfb782a77
>
> Now, I realize that this is not a problem for PostgreSQL itself since the
> rewrite functionality isn't used for the purpose I am using it. To my
> defense, the rewrite code seems to imply that it should be possible to
> write more data to an existing heap according to this comment in
> begin_heap_rewrite: /* new_heap needn't be empty, just locked */.
>
> I've also tried recompiling PG17 with the rewriteheap.c file from PG16 and
> then it works again. I haven't yet been able to figure out exactly what is
> different but I will continue to try to narrow it down. In the meantime,
> maybe someone on the mailing list has some insight on what could be the
> issue and whether my approach is viable?
>

I think the origin of the issue is at bulk_write.c, in smgr_bulk_flush
(specifically after line 282), which assumes the bulk write system is used
exclusively on empty relations.

If you use a separate pair of begin/end_heap_rewrite for every relation you
merge into that heap, it'll re-initialize the bulk writer, which will thus
overwrite the previous rewrites' pages. The pre-PG17 rewriteheap.c doesn't
use that API, and thus isn't affected.

I've CC-ed Heikki as author of that patch; maybe a new API to indicate bulk
writes into an existing fork would make sense, to solve Timescale's issue
and fix rewriteheap's behavior?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Paul Jungwirth 2024-11-22 00:30:06 Re: SQL:2011 application time
Previous Message Giacchino, Luca 2024-11-21 23:27:33 SIMD optimization for list_sort