Re: Add LSN <-> time conversion functionality

From: "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Add LSN <-> time conversion functionality
Date: 2024-08-01 10:37:05
Message-ID: 334BD6DF-35D6-40E8-BF37-45360955E2CD@yandex-team.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

This is a copy of my message for pgsql-hackers mailing list. Unfortunately original message was rejected due to one of recipients addresses is blocked.

> On 1 Aug 2024, at 10:54, Andrey M. Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
>
>
>
>> On 1 Aug 2024, at 05:44, Melanie Plageman <melanieplageman(at)gmail(dot)com> wrote:
>>
>> Thanks for the review! v6 attached.
>>
>> On Sat, Jul 6, 2024 at 1:36 PM Andrey M. Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
>>>
>>> PgStartLSN = GetXLogInsertRecPtr();
>>> Should this be kind of RecoveryEndPtr? How is it expected to behave on Standby in HA cluster, which was doing a crash recovery of 1y WALs in a day, then is in startup for a year as a Hot Standby, and then is promoted?
>>
>> So, I don't think we will allow use of the LSNTimeStream on a standby,
>> since it is unclear what it would mean on a standby. For example, do
>> you want to know the time the LSN was generated or the time it was
>> replayed? Note that bgwriter won't insert values to the time stream on
>> a standby (it explicitly checks).
>
> Yes, I mentioned Standby because PgStartLSN is not what it says it is.
>
>>
>> But, you bring up an issue that I don't quite know what to do about.
>> If the standby doesn't have an LSNTimeStream, then when it is
>> promoted, LSN <-> time conversions of LSNs and times before the
>> promotion seem impossible. Maybe if the stats file is getting written
>> out at checkpoints, we could restore from that previous primary's file
>> after promotion?
>
> I’m afraid that clocks of a Primary from previous timeline might be not in sync with ours.
> It’s OK if it causes error, we just need to be prepared when they indicate values from future. Perhaps, by shifting their last point to our “PgStartLSN”.
>
>>
>> This brings up a second issue, which is that, currently, bgwriter
>> won't insert into the time stream when wal_level is minimal. I'm not
>> sure exactly how to resolve it because I am relying on the "last
>> important rec pointer" and the LOG_SNAPSHOT_INTERVAL_MS to throttle
>> when the bgwriter actually inserts new records into the LSNTimeStream.
>> I could come up with a different timeout interval for updating the
>> time stream. Perhaps I should do that?
>
> IDK. My knowledge of bgwriter is not enough to give a meaningful advise here.
>
>>
>>> lsn_ts_calculate_error_area() is prone to overflow. Even int64 does not seem capable to accommodate LSN*time. And the function may return negative result, despite claiming area as a result. It’s intended, but a little misleading.
>>
>> Ah, great point. I've fixed this.
>
> Well, not exactly. Result of lsn_ts_calculate_error_area() is still fabs()’ed upon usage. I’d recommend to fabs in function.
> BTW lsn_ts_calculate_error_area() have no prototype.
>
> Also, I’m not a big fan of using IEEE 754 float in this function. This data type have 24 bits of significand bits.
> Consider that current timestamp has 50 binary digits. Let’s assume realistic LSNs have same 50 bits.
> Then our rounding error is 2^76 byte*microseconds.
> Let’s assume we are interested to measure time on a scale of 1GB WAL records.
> This gives us rounding error of 2^46 microseconds = 2^26 seconds = 64 million seconds = 2 years.
> Seems like a gross error.
>
> If we use IEEE 754 doubles we have 53 significand bytes. And rounding error will be on a scale of 128 microseconds per GB, which is acceptable.
>
> So I think double is better than float here.
>
> Nitpicking, but I’d prefer to sum up (triangle2 + triangle3 + rectangle_part) before subtracting. This can save a bit of precision (smaller figures can have lesser exponent).
>
>
>>>> On 29 Jun 2024, at 03:09, Melanie Plageman <melanieplageman(at)gmail(dot)com> wrote:
>>>> change the user-facing functions for estimating an
>>>> LSN/time conversion to instead return a floor and a ceiling -- instead
>>>> of linearly interpolating a guess. This would be a way to keep users
>>>> from misunderstanding the accuracy of the functions to translate LSN
>>>> <-> time.
>>>
>>> I think this is a good idea. And it covers well “server restart problem”. If API just returns -inf as a boundary, caller can correctly interpret this situation.
>>
>> Providing "ceiling" and "floor" user functions is still a TODO for me,
>> however, I think that the patch mostly does handle server restarts.
>>
>> In the event of a restart, the cumulative stats system will have
>> persisted our time stream, so the LSNTimeStream will just be read back
>> in with the rest of the stats. I've added logic to ensure that if the
>> PgStartLSN is newer than our oldest LSNTimeStream entry, we use the
>> oldest entry instead of PgStartLSN when doing conversions LSN <->
>> time.
>>
>> As for a crash, stats do not persist crashes, but I think Michael's
>> patch will go in to write out the stats file at checkpoints, and then
>> this should be good enough.
>>
>> Is there anything else you think that is an issue with restarts?
>
> Nope, looks good to me.
>
>>
>>> Thanks! Looking forward to more timely freezing.
>>
>> Thanks! I'll be posting a new version of the opportunistic freezing
>> patch that uses the time stream quite soon, so I hope you'll take a
>> look at that as well!
>
> Great! Thank you!
> Besides your TODOs and my nitpicking this patch series looks RfC to me.
>
> I have to address some review comments on my patches, then I hope I’ll switch to reviewing opportunistic freezing.
>
>
> Best regards, Andrey Borodin.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthias van de Meent 2024-08-01 10:45:16 Re: PG17beta2: SMGR: inconsistent type for nblocks
Previous Message Peter Eisentraut 2024-08-01 10:32:14 Re: make pg_createsubscriber option names more consistent