Re: Add LSN <-> time conversion functionality

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Daniel Gustafsson <daniel(at)yesql(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Ilya Kosmodemiansky <hydrobiont(at)gmail(dot)com>
Subject: Re: Add LSN <-> time conversion functionality
Date: 2024-08-09 00:39:24
Message-ID: 0c899f80-6c12-4ac8-a0d9-86e8aa6a0914@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 8/8/24 20:59, Robert Haas wrote:
> On Thu, Aug 8, 2024 at 2:34 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> How could we do this? We have 1s precision, so we start with buckets for
>> each seconds. And we want to allow merging stuff nicely. The smallest
>> merges we could do is 1s -> 2s -> 4s -> 8s -> ... but let's say we do
>> 1s -> 10s -> 100s -> 1000s instead.
>>
>> So we start with 100x one-second buckets
>>
>> [A_0, A_1, ..., A_99] -> 100 x 1s buckets
>> [B_0, B_1, ..., B_99] -> 100 x 10s buckets
>> [C_0, C_1, ..., C_99] -> 100 x 100s buckets
>> [D_0, D_1, ..., D_99] -> 100 x 1000s buckets
>>
>> We start by adding data into A_k buckets. After filling all 100 of them,
>> we grab the oldest 10 buckets, and combine/move them into B_k. And so
>> on, until B is gets full too. Then we grab the 10 oldest B_k entries,
>> and move them into C. and so on. For D the oldest entries would get
>> discarded, or we could add another layer with each bucket representing
>> 10k seconds.
>
> Yeah, this kind of algorithm makes sense to me, although as you say
> later, I don't think we need this amount of precision. I also think
> you're right to point out that this provides certain guaranteed
> behavior.
>
>> A-D is already enough to cover 30h, with A-E it'd be ~300h. Do we need
>> (or want) to keep a longer history?
>
> I think there is a difference of opinion about this between Melanie
> and I. I feel like we should be designing something that does the
> exact job we need done for the freezing stuff, and if anyone else can
> use it, that's a bonus. For that, I feel that 300h is more than
> plenty. The goal of the freezing stuff, roughly speaking, is to answer
> the question "will this be unfrozen real soon?". "Real soon" could
> arguably mean a minute or an hour, but I don't think it makes sense
> for it to be a week. If we're freezing data now that has a good chance
> of being unfrozen again within 7 days, we should just freeze it
> anyway. The cost of freezing isn't really all that high. If we keep
> freezing pages that are going to be unfrozen again within seconds or
> minutes, we pay those freezing costs enough times that they become
> material, but I have difficulty imagining that it ever matters if we
> re-freeze the same page every week. It's OK to be wrong as long as we
> aren't wrong too often, and I think that being wrong once per page per
> week isn't too often.
>
> But I think Melanie was hoping to create something more general, which
> on one level is understandable, but on the other hand it's unclear
> what the goals are exactly. If we limit our scope to specifically
> VACUUM, we can make reasonable guesses about how much precision we
> need and for how long. But a hypothetical other client of this
> infrastructure could need anything at all, which makes it very unclear
> what the best design is, IMHO.
>

I don't have a strong opinion on this. I agree with you it's better to
have a good solution for the problem at hand than a poor solution for
hypothetical use cases. I don't have a clear idea what the other use
cases would be, which makes it hard to say what precision/history would
be necessary. But I also understand the wish to make it useful for a
wider set of use cases, when possible. I'd try to do the same thing.

But I think a clear description of the precision guarantees helps to
achieve that (even if the algorithm could be different).

If the only argument ends up being about how precise it needs to be and
how much history we need to cover, I think that's fine because that's
just a matter of setting a couple config parameters.

>> Isn't this a sign this does not quite fit into pgstats? Even if this
>> happens to deal with unsafe restarts, replica promotions and so on, what
>> if the user just does pg_stat_reset? That already causes trouble because
>> we simply forget deleted/updated/inserted tuples. If we also forget data
>> used for freezing heuristics, that does not seem great ...
>
> +1.
>
>> Wouldn't it be better to write this into WAL as part of a checkpoint (or
>> something like that?), and make bgwriter to not only add LSN/timestamp
>> into the stream, but also write it into WAL. It's already waking up, on
>> idle systems ~32B written to WAL does not matter, and on busy system
>> it's just noise.
>
> I am not really sure of the best place to put this data. I agree that
> pgstat doesn't feel like quite the right place. But I'm not quite sure
> that putting it into every checkpoint is the right idea either.
>

Is there a reason not to make this just another SLRU, just like we do
for commit_ts? I'm not saying it's perfect, but it's an approach we
already use to solve these issues.

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2024-08-09 00:53:30 Re: Add LSN <-> time conversion functionality
Previous Message Peter Smith 2024-08-09 00:20:43 Re: Logical Replication of sequences