Re: post-freeze damage control

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: David Steele <david(at)pgmasters(dot)net>, Tom Kincaid <tomjohnkincaid(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Stefan Fercot <stefan(dot)fercot(at)protonmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: post-freeze damage control
Date: 2024-04-13 11:02:03
Message-ID: b897e444-5896-4cc7-86e0-49bf41547996@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4/13/24 01:23, David Steele wrote:
> On 4/12/24 22:27, Tomas Vondra wrote:
>>
>>
>> On 4/12/24 08:42, David Steele wrote:
>>> On 4/11/24 20:26, Tomas Vondra wrote:
>>>> On 4/11/24 03:52, David Steele wrote:
>>>>> On 4/11/24 10:23, Tom Kincaid wrote:
>>>>>>
>>>>>> The extensive Beta process we have can be used to build confidence we
>>>>>> need in a feature that has extensive review and currently has no
>>>>>> known
>>>>>> issues or outstanding objections.
>>>>>
>>>>> I did have objections, here [1] and here [2]. I think the complexity,
>>>>> space requirements, and likely performance issues involved in restores
>>>>> are going to be a real problem for users. Some of these can be
>>>>> addressed
>>>>> in future releases, but I can't escape the feeling that what we are
>>>>> releasing here is half-baked.
>>>>>
>>>> I do not think it's half-baked. I certainly agree there are
>>>> limitations,
>>>> and there's all kinds of bells and whistles we could add, but I think
>>>> the fundamental infrastructure is corrent and a meaningful step
>>>> forward.
>>>> Would I wish it to handle .tar for example? Sure I would. But I think
>>>> it's something we can add in the future - if we require all of this to
>>>> happen in a single release, it'll never happen.
>>>
>>> I'm not sure that I really buy this argument, anyway. It is not uncommon
>>> for significant features to spend years in development before they are
>>> committed. This feature went from first introduction to commit in just
>>> over six months. Obviously Robert had been working on it for a while,
>>> but for a feature this large six months is a sprint.
>>>
>>
>> Sure, but it's also not uncommon for significant features to be
>> developed incrementally, over multiple releases, introducing the basic
>> infrastructure first, and then expanding the capabilities later. I'd
>> cite logical decoding/replication and parallel query as examples of this
>> approach.
>>
>> It's possible there's some fundamental flaw in the WAL summarization?
>> Sure, I can't rule that out, although I find it unlikely. Could there be
>> bugs? Sure, that's possible, but that applies to all code.
>>
>> But it seems to me all the comments are about the client side, not about
>> the infrastructure. Which is fair, I certainly agree it'd be nice to
>> handle more use cases with less effort, but I still think the patch is a
>> meaningful step forward.
>
> Yes, my comments are all about the client code. I like the
> implementation of the WAL summarizer a lot. I don't think there is a
> fundamental flaw in the design, either, but I wouldn't be surprised if
> there are bugs. That's life in software development biz.
>

Agreed.

> Even for the summarizer, though, I do worry about the complexity of
> maintaining it over time. It seems like it would be very easy to
> introduce a bug and have it go unnoticed until it causes problems in the
> field. A lot of testing was done outside of the test suite for this
> feature and I'm not sure if we can rely on that focus with every release.
>

I'm not sure there's a simpler way to implement this. I haven't really
worked on that part (not until the CoW changes a couple weeks ago), but
I think Robert was very conscious of the complexity.

I don't think expect this code to change very often, but I agree it's
not great to rely on testing outside the regular regression test suite.
But I'm not sure how much more we can do, really - for example my
testing was very much "randomized stress testing" with a lot of data and
long runs, looking for unexpected stuff. That's not something we could
do in the usual regression tests, I think.

But if you have suggestions how to extend the testing ...

> For me an incremental approach would be to introduce the WAL summarizer
> first. There are already plenty of projects that do page-level
> incremental (WAL-G, pg_probackup, pgBackRest) and could help shake out
> the bugs. Then introduce the client tools later when they are more
> robust. Or, release the client tools now but mark them as experimental
> or something so people know that changes are coming and they don't get
> blindsided by that in the next release. Or, at the very least, make the
> caveats very clear so users can make an informed choice.
>

I don't think introducing just the summarizer, without any client tools,
would really work. How would we even test the summarizer, for example?
If the only users of that code are external tools, we'd do only some
very rudimentary tests. But the more complex tests would happen in the
external tools, which means it wouldn't be covered by cfbot, buildfarm
and so on. Considering the external tools are likely a bit behind, It's
not clear to me how I would do the stress testing, for example.

IMHO we should aim to have in-tree clients when possible, even if some
external tools can do more advanced stuff etc.

This however reminds me my question is the summarizer provides the right
interface(s) for the external tools. One option is to do pg_basebackup
and then parse the incremental files, but is that suitable for the
external tools, or should there be a more convenient way?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2024-04-13 11:07:14 Re: Add missing ConditionVariableCancelSleep() in slot.c
Previous Message Tomas Vondra 2024-04-13 10:18:08 Re: post-freeze damage control