Re: Changed behavior in rewriteheap

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Erik Nordström <erik(at)timescale(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Changed behavior in rewriteheap
Date: 2024-11-22 13:02:08
Message-ID: CAEze2Wg6kz3AypbuvMbx5j5RgasvB6+ykgzRVfrhxsR_m6cCkQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 22 Nov 2024 at 09:11, Erik Nordström <erik(at)timescale(dot)com> wrote:
>
>
>
> On Fri, Nov 22, 2024 at 12:30 AM Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> wrote:
>>
>> On Thu, 21 Nov 2024, 17:18 Erik Nordström, <erik(at)timescale(dot)com> wrote:
>>>
>>> Hello,
>>>
>>> I've noticed a change in behavior of the heap rewrite functionality in PostgreSQL 17, used by, e.g., CLUSTER. I've been experimenting with the functionality to implement a way to merge partitions in TimescaleDB. I am using table_relation_copy_for_cluster() to write the data of several tables to a single merged table, and then I do a heap swap on one of the original tables while dropping the others. So, if you have 3 partitions and want to merge them to one, then I write all three partitions to a temporary heap, swap the new heap on partition 1 and then drop partitions 2 and 3.
>>>
>>> Now, this approach worked fine for PostgreSQL 15 and 16, but 17 introduced some changes that altered the behavior so that I only see data from one of the partitions after merge (the last one written).
[...]
>>> I've also tried recompiling PG17 with the rewriteheap.c file from PG16 and then it works again. I haven't yet been able to figure out exactly what is different but I will continue to try to narrow it down. In the meantime, maybe someone on the mailing list has some insight on what could be the issue and whether my approach is viable?
>>
>>
>> I think the origin of the issue is at bulk_write.c, in smgr_bulk_flush (specifically after line 282), which assumes the bulk write system is used exclusively on empty relations.
>>
>> If you use a separate pair of begin/end_heap_rewrite for every relation you merge into that heap, it'll re-initialize the bulk writer, which will thus overwrite the previous rewrites' pages. The pre-PG17 rewriteheap.c doesn't use that API, and thus isn't affected.
>>
>
>
> Yes, looks like it is that code that zeros out the initial pages if the write doesn't start at block 0. And it looks like I don't have access to the bulk write state to set pages_written to trick it to believe those pages have already been written. Maybe it is possible to add an option to the bulkwriter to skip zero-initialization? The comment for that code seems to indicate it isn't strictly necessary anyway.

I think I'd go with a patch like attached, where the bulk writer
registers that it started with .relsize pages in the relfork, and use
that for smgrextend() decisions. It now also records pages_written as
a separate but accurate value.

Side note: Interestingly, on a replica you would see all relation's
data in that table, or at least until the primary writes new FPIs for
those pages, as the smgrextend() which clears the pages isn't
WAL-logged and thus doesn't get transfered to a replica.

Kind regards,

Matthias van de Meent

Attachment Content-Type Size
v0-0001-Fix-data-loss-when-restarting-the-bulk_write-faci.patch application/octet-stream 3.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rahila Syed 2024-11-22 13:03:36 Re: Enhancing Memory Context Statistics Reporting
Previous Message vignesh C 2024-11-22 12:13:24 Re: Introduce XID age and inactive timeout based replication slot invalidation