Re: Add LSN <-> time conversion functionality

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Daniel Gustafsson <daniel(at)yesql(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Ilya Kosmodemiansky <hydrobiont(at)gmail(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>
Subject: Re: Add LSN <-> time conversion functionality
Date: 2024-08-08 18:34:40
Message-ID: 45db8b62-12a6-45e1-90cc-2bf5c141df72@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 8/7/24 21:39, Melanie Plageman wrote:
> On Wed, Aug 7, 2024 at 1:06 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>
>> As I mentioned to you off-list, I feel like this needs some sort of
>> recency bias. Certainly vacuum, and really almost any conceivable user
>> of this facility, is going to care more about accurate answers for new
>> data than for old data. If there's no recency bias, then I think that
>> eventually answers for more recent LSNs will start to become less
>> accurate, since they've got to share the data structure with more and
>> more time from long ago. I don't think you've done anything about this
>> in this version of the patch, but I might be wrong.
>
> That makes sense. This version of the patch set doesn't have a recency
> bias implementation. I plan to work on it but will need to do the
> testing like you mentioned.
>

I agree that it's likely we probably want more accurate results for
recent data, so some recency bias makes sense - for example for the
eager vacuuming that's definitely true.

But this was initially presented as a somewhat universal LSN/timestamp
mapping, and in that case it might make sense to minimize the average
error - which I think is what lsntime_to_drop() currently does, by
calculating the "area" etc.

Maybe it'd be good to approach this from the opposite direction, say
what "accuracy guarantees" we want to provide, and then design the
structure / algorithm to ensure that. Otherwise we may end up with an
infinite discussion about algorithms with unclear idea which one is the
best choice.

And I'm sure "users" of the LSN/Timestamp mapping may get confused about
what to expect, without reasonably clear guarantees.

For example, it seems to me a "good" accuracy guarantee would be:

Given a LSN, the age of the returned timestamp is less than 10% off
the actual timestamp. The timestamp precision is in seconds.

This means that if LSN was written 100 seconds ago, it would be OK to
get an answer in the 90-110 seconds range. For LSN from 1h ago, the
acceptable range would be 3600s +/- 360s. And so on. The 10% is just
arbitrary, maybe it should be lower - doesn't matter much.

How could we do this? We have 1s precision, so we start with buckets for
each seconds. And we want to allow merging stuff nicely. The smallest
merges we could do is 1s -> 2s -> 4s -> 8s -> ... but let's say we do
1s -> 10s -> 100s -> 1000s instead.

So we start with 100x one-second buckets

[A_0, A_1, ..., A_99] -> 100 x 1s buckets
[B_0, B_1, ..., B_99] -> 100 x 10s buckets
[C_0, C_1, ..., C_99] -> 100 x 100s buckets
[D_0, D_1, ..., D_99] -> 100 x 1000s buckets

We start by adding data into A_k buckets. After filling all 100 of them,
we grab the oldest 10 buckets, and combine/move them into B_k. And so
on, until B is gets full too. Then we grab the 10 oldest B_k entries,
and move them into C. and so on. For D the oldest entries would get
discarded, or we could add another layer with each bucket representing
10k seconds.

A-D is already enough to cover 30h, with A-E it'd be ~300h. Do we need
(or want) to keep a longer history?

These arrays are larger than what the current patch does, ofc. That has
64 x 16B entries, so 1kB. These arrays have ~6kB - but I'm pretty sure
it could be made more compact, by growing the buckets slower. With 10x
it's just simpler to think about, and also - 6kB seems pretty decent.

Note: I just realized the patch does LOG_STREAM_INTERVAL_MS = 30s, so
the 1s accuracy seems like an overkill, and it could be much smaller.

>> One way to make the standby more accurately mimic the primary would be
>> to base entries on the timestamp-LSN data that is already present in
>> the WAL, i.e. {COMMIT|ABORT} [PREPARED] records. If you only added or
>> updated entries on the primary when logging those records, the standby
>> could redo exactly what the primary did. A disadvantage of this
>> approach is that if there are no commits for a while then your mapping
>> gets out of date, but that might be something we could just tolerate.
>> Another possible solution is to log the changes you make on the
>> primary and have the standby replay those changes. Perhaps I'm wrong
>> to advocate for such solutions, but it feels error-prone to have one
>> algorithm for the primary and a different algorithm for the standby.
>> You now basically have two things that can break and you have to debug
>> what went wrong instead of just one.
>
> Your point about maintaining two different systems for creating the
> time stream being error prone makes sense. Honestly logging the
> contents of the LSNTimeStream seems like it will be the simplest to
> maintain and understand. I was a bit apprehensive to WAL log one part
> of a single stats structure (since the other stats aren't logged), but
> I think explaining why that's done is easier than explaining separate
> LSNTimeStream creation code for replicas.
>

Isn't this a sign this does not quite fit into pgstats? Even if this
happens to deal with unsafe restarts, replica promotions and so on, what
if the user just does pg_stat_reset? That already causes trouble because
we simply forget deleted/updated/inserted tuples. If we also forget data
used for freezing heuristics, that does not seem great ...

Wouldn't it be better to write this into WAL as part of a checkpoint (or
something like that?), and make bgwriter to not only add LSN/timestamp
into the stream, but also write it into WAL. It's already waking up, on
idle systems ~32B written to WAL does not matter, and on busy system
it's just noise.

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2024-08-08 18:49:36 Re: Enable data checksums by default
Previous Message Dean Rasheed 2024-08-08 18:22:28 Re: Virtual generated columns