Re: Add LSN <-> time conversion functionality

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Daniel Gustafsson <daniel(at)yesql(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Ilya Kosmodemiansky <hydrobiont(at)gmail(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>
Subject: Re: Add LSN <-> time conversion functionality
Date: 2024-08-09 01:02:15
Message-ID: CAAKRu_YM0vLFxFL4EkBGoUvWqOroR7ja=k5A+fB=FZQCb-DvMg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 8, 2024 at 2:34 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> On 8/7/24 21:39, Melanie Plageman wrote:
> > On Wed, Aug 7, 2024 at 1:06 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >>
> >> As I mentioned to you off-list, I feel like this needs some sort of
> >> recency bias. Certainly vacuum, and really almost any conceivable user
> >> of this facility, is going to care more about accurate answers for new
> >> data than for old data. If there's no recency bias, then I think that
> >> eventually answers for more recent LSNs will start to become less
> >> accurate, since they've got to share the data structure with more and
> >> more time from long ago. I don't think you've done anything about this
> >> in this version of the patch, but I might be wrong.
> >
> > That makes sense. This version of the patch set doesn't have a recency
> > bias implementation. I plan to work on it but will need to do the
> > testing like you mentioned.
> >
>
> I agree that it's likely we probably want more accurate results for
> recent data, so some recency bias makes sense - for example for the
> eager vacuuming that's definitely true.
>
> But this was initially presented as a somewhat universal LSN/timestamp
> mapping, and in that case it might make sense to minimize the average
> error - which I think is what lsntime_to_drop() currently does, by
> calculating the "area" etc.
>
> Maybe it'd be good to approach this from the opposite direction, say
> what "accuracy guarantees" we want to provide, and then design the
> structure / algorithm to ensure that. Otherwise we may end up with an
> infinite discussion about algorithms with unclear idea which one is the
> best choice.
>
> And I'm sure "users" of the LSN/Timestamp mapping may get confused about
> what to expect, without reasonably clear guarantees.
>
> For example, it seems to me a "good" accuracy guarantee would be:
>
> Given a LSN, the age of the returned timestamp is less than 10% off
> the actual timestamp. The timestamp precision is in seconds.
>
> This means that if LSN was written 100 seconds ago, it would be OK to
> get an answer in the 90-110 seconds range. For LSN from 1h ago, the
> acceptable range would be 3600s +/- 360s. And so on. The 10% is just
> arbitrary, maybe it should be lower - doesn't matter much.

I changed this patch a bit to only provide ranges with an upper and
lower bound from the SQL callable functions. While the size of the
range provided could be part of our "accuracy guarantee", I'm not sure
if we have to provide that.

> How could we do this? We have 1s precision, so we start with buckets for
> each seconds. And we want to allow merging stuff nicely. The smallest
> merges we could do is 1s -> 2s -> 4s -> 8s -> ... but let's say we do
> 1s -> 10s -> 100s -> 1000s instead.
>
> So we start with 100x one-second buckets
>
> [A_0, A_1, ..., A_99] -> 100 x 1s buckets
> [B_0, B_1, ..., B_99] -> 100 x 10s buckets
> [C_0, C_1, ..., C_99] -> 100 x 100s buckets
> [D_0, D_1, ..., D_99] -> 100 x 1000s buckets
>
> We start by adding data into A_k buckets. After filling all 100 of them,
> we grab the oldest 10 buckets, and combine/move them into B_k. And so
> on, until B is gets full too. Then we grab the 10 oldest B_k entries,
> and move them into C. and so on. For D the oldest entries would get
> discarded, or we could add another layer with each bucket representing
> 10k seconds.

I originally had an algorithm that stored old values somewhat like
this (each element stored 2x logical members of the preceding
element). When I was testing algorithms, I abandoned this method
because it was less accurate than the method which calculates the
interpolation error "area". But, this would be expected -- it would be
less accurate for older values.

I'm currently considering an algorithm that uses a combination of the
interpolation error and the age of the point. I'm thinking of adding
to or dividing the error of each point by "now - that point's time (or
lsn)". This would lead me to be more likely to drop points that are
older.

This is a bit different than "combining" buckets, but it seems like it
might allow us to drop unneeded recent points when they are very
regular.

> Isn't this a sign this does not quite fit into pgstats? Even if this
> happens to deal with unsafe restarts, replica promotions and so on, what
> if the user just does pg_stat_reset? That already causes trouble because
> we simply forget deleted/updated/inserted tuples. If we also forget data
> used for freezing heuristics, that does not seem great ...
>
> Wouldn't it be better to write this into WAL as part of a checkpoint (or
> something like that?), and make bgwriter to not only add LSN/timestamp
> into the stream, but also write it into WAL. It's already waking up, on
> idle systems ~32B written to WAL does not matter, and on busy system
> it's just noise.

I was imagining adding a new type of WAL record that contains just the
LSN and time and writing it out in bgwriter. Is that not what you are
thinking?

- Melanie

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2024-08-09 01:29:15 Re: Add LSN <-> time conversion functionality
Previous Message jian he 2024-08-09 01:01:27 Re: Remove dependence on integer wrapping