Re: Commit Timestamp and LSN Inversion issue

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Aleksander Alekseev <aleksander(at)timescale(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Jan Wieck <jan(at)wi3ck(dot)info>
Subject: Re: Commit Timestamp and LSN Inversion issue
Date: 2024-11-12 12:00:08
Message-ID: 2c754a6c-478f-4a9e-9f3f-4ea6d8db5188@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/12/24 04:05, Amit Kapila wrote:
> On Mon, Nov 11, 2024 at 9:05 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>
>> On 11/11/24 09:19, Amit Kapila wrote:
>>>
>>> I can't think of a solution other than the current proposal where we
>>> do both the operations (reserve WAL space for commit and adjust
>>> commit_timestamp, if required) together to resolve LSN<->Timestamp
>>> invertion issue. Please let us know if you have any other ideas for
>>> solving this. Even, if we want to change ReserveXLogInsertLocation to
>>> make the locked portion an atomic operation, we can still do it for
>>> records other than commit. Now, if we don't have any other solution
>>> for LSN<->Timestamp inversion issue, changing the entire locked
>>> portion to atomic will close the door to solving this problem forever
>>> unless we have some other way to solve it which can make it difficult
>>> to rely on commit_timestamps for certain scenarios.
>>>
>>
>> I don't know what the solution is, isn't the problem that
>>
>> (a) we record both values (LSN and timestamp) during commit
>>
>> (b) reading timestamp from system clock can be quite expensive
>>
>> It seems to me that if either of those two issues disappeared, we
>> wouldn't have such an issue.
>>
>> For example, imagine getting a timestamp is super cheap - just reading
>> and updating a simple counter from shmem, just like we do for the LSN.
>> Wouldn't that solve the problem?
>>
>
> Yeah, this is exactly what I thought.
>
>> For example, let's say XLogCtlInsert has two fields
>>
>> int64 CommitTimestamp;
>>
>> and that ReserveXLogInsertLocation() also does this for each commit:
>>
>> commit_timestamp = Insert->commit_timestamp++;
>>
>> while holding the insertpos_lock. Now we have the LSN and timestamp
>> perfectly correlated.
>>
>
> Right, and the patch sent by Hou-San [1] (based on the original patch
> by Jan) is somewhat on these lines. The idea you have shared or
> implemented by the patch is a logical clock stored in shared memory.
> So, what the patch does is that if the time recorded by the current
> commit record is lesser than or equal to the logical clock (which
> means after we record time in the commit code path and before we
> reserve the LSN, there happens a concurrent transaction), we shall
> increment the value of logical clock by one and use that as commit
> time.
>
> So, in most cases, we need to perform one additional "if check" and
> "an assignment" under spinlock, and that too only for the commit
> record. In some cases, when inversion happens, there will be "one
> increment" and "two assignments."
>

In a way, yes. Except that each backend first sets the commit timestamp
the usual way too. And then it also has to recalculate the checksum of
the record - which happens under lwlock and not the main spinlock, but
it still seems it might be quite expensive.

I wonder how often we'd need to do this - it seems to me that for high
throughput systems we might easily get into a situation where we have to
"correct" the timestamp and recalculate the checksum (so calculating it
twice) for most of the commits.

Would be good to see some numbers from experiments - how often this can
happen, what's the impact on throughput etc.

Ofc, you may argue that this would only affect people with the hook, and
those people may be OK with the impact. And in a way I agree. But I also
understand the concerns expressed by Andres, that adding a hook here may
be problematic / easy to get wrong, and then we'd be the ones holding
the brown bag ...

>> Of course, this timestamp would be completely
>> detached from "normal" timestamps, it'd just be a sequence. But we could
>> also once in a while "set" the timestamp to GetCurrentTimestamp(), or
>> something like that. For example, there could be a "ticker" process,
>> doing that every 1000ms, or 100ms, or whatever we think is enough.
>>
>> Or maybe it could be driven by the commit itself, say when some timeout
>> expires, we assign too many items since the last commit, ...
>>
>
> If we follow the patch, we don't need this additional stuff. Isn't
> what is proposed [1] quite similar to your idea? If so, we can save
> this extra maintenance of commit timestamps.
>

Not sure. But I think we already have some of the necessary parts in
commit_ts. It'd need some improvements, but it might also be a good
place to adjust the timestamps. The inversion can only happens within a
small group of recent transactions (due to NUM_XLOGINSERT_LOCKS), so it
shouldn't be hard to keep a small buffer for those XIDs. Ofc, subxacts
would complicate stuff, and it'd mean the timestamps written to WAL and
to commit_ts would differ. Not great.

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2024-11-12 12:07:48 Re: Disallow UPDATE/DELETE on table with unpublished generated column as REPLICA IDENTITY
Previous Message ISHAN CHHANGANI 2024-11-12 11:28:24 Extract constants from EXECUTE queries