Re: Add LSN <-> time conversion functionality

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>
Cc: Daniel Gustafsson <daniel(at)yesql(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Ilya Kosmodemiansky <hydrobiont(at)gmail(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>
Subject: Re: Add LSN <-> time conversion functionality
Date: 2024-08-07 16:30:53
Message-ID: CAAKRu_ad2DsWjt+h2kV43Nwe1+wm5T__VrvXeuZHwV05f6iPjQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Attached v7 changes the SQL-callable functions to return ranges of
LSNs and times covering the target time or LSN instead of linearly
interpolating an approximate answer.

I also changed the frequency and conditions under which the background
writer updates the global LSNTimeStream. There is now a dedicated
interval at which the LSNTimeStream is updated (instead of reusing the
log standby snapshot interval).

I also found that it is incorrect to set PgStartLSN to the insert LSN
in PostmasterMain(). The XLog buffer cache is not guaranteed to be
initialized in time. Instead of trying to provide an LSN lower bound
for locating times before those recorded on the global LSNTimeStream,
I simply return a lower bound of InvalidXLogRecPtr. Similarly, I
provide a lower bound of -infinity when locating LSNs before those
recorded on the global LSNTimeStream.

On Thu, Aug 1, 2024 at 3:55 AM Andrey M. Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
>
> > On 1 Aug 2024, at 05:44, Melanie Plageman <melanieplageman(at)gmail(dot)com> wrote:
> >
> > On Sat, Jul 6, 2024 at 1:36 PM Andrey M. Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
> >>
> >> PgStartLSN = GetXLogInsertRecPtr();
> >> Should this be kind of RecoveryEndPtr? How is it expected to behave on Standby in HA cluster, which was doing a crash recovery of 1y WALs in a day, then is in startup for a year as a Hot Standby, and then is promoted?
> >
> > So, I don't think we will allow use of the LSNTimeStream on a standby,
> > since it is unclear what it would mean on a standby. For example, do
> > you want to know the time the LSN was generated or the time it was
> > replayed? Note that bgwriter won't insert values to the time stream on
> > a standby (it explicitly checks).
>
> Yes, I mentioned Standby because PgStartLSN is not what it says it is.

Right, I've found another way of dealing with this since PgStartLSN
was incorrect.

> > But, you bring up an issue that I don't quite know what to do about.
> > If the standby doesn't have an LSNTimeStream, then when it is
> > promoted, LSN <-> time conversions of LSNs and times before the
> > promotion seem impossible. Maybe if the stats file is getting written
> > out at checkpoints, we could restore from that previous primary's file
> > after promotion?
>
> I’m afraid that clocks of a Primary from previous timeline might be not in sync with ours.
> It’s OK if it causes error, we just need to be prepared when they indicate values from future. Perhaps, by shifting their last point to our “PgStartLSN”.

Regarding a standby being promoted. I plan to make a version of the
LSNTimeStream functions which works on a standby by using
getRecordTimestamp() and inserts an LSN from the last record replayed
and the associated timestamp. That should mean the LSNTimeStream on
the standby is roughly the same as the one on the primary since those
records were inserted on the primary.

As for time going backwards in general, I've now made it so that
values are only inserted if the times are monotonically increasing and
the LSN is the same or increasing. This should handle time going
backwards, either on the primary itself or after a standby is promoted
if the timeline wasn't a perfect match.

> > This brings up a second issue, which is that, currently, bgwriter
> > won't insert into the time stream when wal_level is minimal. I'm not
> > sure exactly how to resolve it because I am relying on the "last
> > important rec pointer" and the LOG_SNAPSHOT_INTERVAL_MS to throttle
> > when the bgwriter actually inserts new records into the LSNTimeStream.
> > I could come up with a different timeout interval for updating the
> > time stream. Perhaps I should do that?
>
> IDK. My knowledge of bgwriter is not enough to give a meaningful advise here.

See my note at top of the email.

> >> lsn_ts_calculate_error_area() is prone to overflow. Even int64 does not seem capable to accommodate LSN*time. And the function may return negative result, despite claiming area as a result. It’s intended, but a little misleading.
> >
> > Ah, great point. I've fixed this.
>
> Well, not exactly. Result of lsn_ts_calculate_error_area() is still fabs()’ed upon usage. I’d recommend to fabs in function.
> BTW lsn_ts_calculate_error_area() have no prototype.
>
> Also, I’m not a big fan of using IEEE 754 float in this function. This data type have 24 bits of significand bits.
> Consider that current timestamp has 50 binary digits. Let’s assume realistic LSNs have same 50 bits.
> Then our rounding error is 2^76 byte*microseconds.
> Let’s assume we are interested to measure time on a scale of 1GB WAL records.
> This gives us rounding error of 2^46 microseconds = 2^26 seconds = 64 million seconds = 2 years.
> Seems like a gross error.
>
> If we use IEEE 754 doubles we have 53 significand bytes. And rounding error will be on a scale of 128 microseconds per GB, which is acceptable.
>
> So I think double is better than float here.
>
> Nitpicking, but I’d prefer to sum up (triangle2 + triangle3 + rectangle_part) before subtracting. This can save a bit of precision (smaller figures can have lesser exponent).

Okay, thanks for the detail. See what you think about v7.

Some perf testing of bgwriter updates are still a todo. I was thinking
that it might be bad to take an exclusive lock on the WAL stats data
structure for the entire time I am inserting a new value to the
LSNTimeStream. I was thinking maybe I should take a share lock and
calculate which element to drop first and then take the exclusive
lock? Or maybe I should make a separate lock for just the stream
member of PgStat_WalStats. Maybe it isn't worth it? I'm not sure.

- Melanie

Attachment Content-Type Size
v7-0002-Add-global-LSNTimeStream-to-PgStat_WalStats.patch text/x-patch 6.4 KB
v7-0003-Add-time-LSN-translation-range-functions.patch text/x-patch 12.6 KB
v7-0001-Add-LSNTimeStream-API-for-converting-LSN-time.patch text/x-patch 16.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amul Sul 2024-08-07 17:05:00 Re: pg_verifybackup: TAR format backup verification
Previous Message Robert Haas 2024-08-07 16:30:18 Re: pgsql: Introduce hash_search_with_hash_value() function