Re: [RFC] Lock-free XLog Reservation from WAL

From: "Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com>
To: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, Japin Li <japinli(at)hotmail(dot)com>
Cc: wenhui qiu <qiuwenhuifx(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] Lock-free XLog Reservation from WAL
Date: 2025-01-27 09:30:28
Message-ID: 7a9e16e2-1e88-4329-aa7d-c2919f7cd727@intel.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/26/2025 10:59 PM, Yura Sokolov wrote:
> 24.01.2025 12:07, Japin Li пишет:
>> On Thu, 23 Jan 2025 at 21:44, Japin Li <japinli(at)hotmail(dot)com> wrote:
>>> On Thu, 23 Jan 2025 at 15:03, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
>>> wrote:
>>>> 23.01.2025 11:46, Japin Li пишет:
>>>>> On Wed, 22 Jan 2025 at 22:44, Japin Li <japinli(at)hotmail(dot)com> wrote:
>>>>>> On Wed, 22 Jan 2025 at 17:02, Yura Sokolov
>>>>>> <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
>>>>>>> I believe, I know why it happens: I was in hurry making v2 by
>>>>>>> cherry-picking internal version. I reverted some changes in
>>>>>>> CalcCuckooPositions manually and forgot to add modulo
>>>>>>> PREV_LINKS_HASH_CAPA.
>>>>>>>
>>>>>>> Here's the fix:
>>>>>>>
>>>>>>>           pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
>>>>>>> -       pos->pos[1] = pos->pos[0] + 1;
>>>>>>> +       pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
>>>>>>>           pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
>>>>>>> -       pos->pos[3] = pos->pos[2] + 2;
>>>>>>> +       pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;
>>>>>>>
>>>>>>> Any way, here's v3:
>>>>>>> - excess file "v0-0001-Increase..." removed. I believe it was source
>>>>>>>     of white-space apply warnings.
>>>>>>> - this mistake fixed
>>>>>>> - more clear slots strategies + "8 positions in two cache-lines"
>>>>>>> strategy.
>>>>>>>
>>>>>>> You may play with switching PREV_LINKS_HASH_STRATEGY to 2 or 3
>>>>>>> and see
>>>>>>> if it affects measurably.
>>>>>>
>>>>>> Thanks for your quick fixing.  I will retest it tomorrow.
>>>>> Hi, Yura Sokolov
>>>>> Here is my test result of the v3 patch:
>>>>> | case                          | min        | avg        | max
>>>>> |
>>>>> |-------------------------------+------------+------------
>>>>> +------------|
>>>>> | master (44b61efb79)           | 865,743.55 | 871,237.40 |
>>>>> 874,492.59 |
>>>>> | v3                            | 857,020.58 | 860,180.11 |
>>>>> 864,355.00 |
>>>>> | v3 PREV_LINKS_HASH_STRATEGY=2 | 853,187.41 | 855,796.36 |
>>>>> 858,436.44 |
>>>>> | v3 PREV_LINKS_HASH_STRATEGY=3 | 863,131.97 | 864,272.91 |
>>>>> 865,396.42 |
>>>>> It seems there are some performance decreases :( or something I
>>>>> missed?
>>>>>
>>>>
>>>> Hi, Japin.
>>>> (Excuse me for duplicating message, I found I sent it only to you
>>>> first time).
>>>>
>>> Never mind!
>>>
>>>> v3 (as well as v2) doesn't increase NUM_XLOGINSERT_LOCKS itself.
>>>> With only 8 in-progress inserters spin-lock is certainly better than
>>>> any
>>>> more complex solution.
>>>>
>>>> You need to compare "master" vs "master + NUM_XLOGINSERT_LOCKS=64" vs
>>>> "master + NUM_XLOGINSERT_LOCKS=64 + v3".
>>>>
>>>> And even this way I don't claim "Lock-free reservation" gives any
>>>> profit.
>>>>
>>>> That is why your benchmarking is very valuable! It could answer, does
>>>> we need such not-small patch, or there is no real problem at all?
>>>>
>>
>> Hi, Yura Sokolov
>>
>> Here is the test result compared with NUM_XLOGINSERT_LOCKS and the v3
>> patch.
>>
>> | case                  | min          | avg          | max          |
>> rate% |
>> |-----------------------+--------------+--------------+--------------
>> +-------|
>> | master (4108440)      | 891,225.77   | 904,868.75   | 913,708.17   |
>>        |
>> | lock 64               | 1,007,716.95 | 1,012,013.22 | 1,018,674.00 |
>> 11.84 |
>> | lock 64 attempt 1     | 1,016,716.07 | 1,017,735.55 | 1,019,328.36 |
>> 12.47 |
>> | lock 64 attempt 2     | 1,015,328.31 | 1,018,147.74 | 1,021,513.14 |
>> 12.52 |
>> | lock 128              | 1,010,147.38 | 1,014,128.11 | 1,018,672.01 |
>> 12.07 |
>> | lock 128 attempt 1    | 1,018,154.79 | 1,023,348.35 | 1,031,365.42 |
>> 13.09 |
>> | lock 128 attempt 2    | 1,013,245.56 | 1,018,984.78 | 1,023,696.00 |
>> 12.61 |
>> | lock 64 v3            | 1,010,893.30 | 1,022,787.25 | 1,029,200.26 |
>> 13.03 |
>> | lock 64 attempt 1 v3  | 1,014,961.21 | 1,019,745.09 | 1,025,511.62 |
>> 12.70 |
>> | lock 64 attempt 2 v3  | 1,015,690.73 | 1,018,365.46 | 1,020,200.57 |
>> 12.54 |
>> | lock 128 v3           | 1,012,653.14 | 1,013,637.09 | 1,014,358.69 |
>> 12.02 |
>> | lock 128 attempt 1 v3 | 1,008,027.57 | 1,016,849.87 | 1,024,597.15 |
>> 12.38 |
>> | lock 128 attempt 2 v3 | 1,020,552.04 | 1,024,658.92 | 1,027,855.90 |
>> 13.24 |

The data looks really interesting and I recognize the need for further
investigation. I'm not very familiar with BenchmarkSQL but we've done
similar tests with HammerDB/TPCC by solely increasing
NUM_XLOGINSERT_LOCKS from 8 to 128, and we observed a significant
performance drop of ~50% and the cycle ratio of spinlock acquisition
(s_lock) rose to over 60% of the total, which is basically consistent
with the previous findings in [1].

Could you please share the details of your test environment, including
the device, configuration, and test approach, so we can collaborate on
understanding the differences?
>
> Sorry for pause, it was my birthday, so I was on short vacation.
>
> So, in total:
> - increasing NUM_XLOGINSERT_LOCKS to 64 certainly helps
> - additional lock attempts seems to help a bit in this benchmark,
>   but it helps more in other (rather synthetic) benchmark [1]
> - my version of lock-free reservation looks to help a bit when
>   applied alone, but look strange in conjunction with additional
>   lock attempts.
>
> I don't see small improvement from my version of Lock-Free reservation
> (1.1% = 1023/1012) pays for its complexity at the moment.

Due to limited hardware resources, I only had the opportunity to measure
the performance impact of your v1 patch of the lock-free hash table with
64 NUM_XLOGINSERT_LOCKS and the two lock attempt patch. I observed an
improvement of *76.4%* (RSD: 4.1%) when combining them together on the
SPR with 480 vCPUs. I understand that your test devices may not have as
many cores, which might be why this optimization brings an unnoticeable
impact. However, I don't think this is an unreal problem. In fact, this
issue was raised by our customer who is trying to deploy Postgres on
devices with hundreds of cores, and I believe the resolution of this
performance issue would result in real impacts.

>
> Probably, when other places will be optimized/improved, it will pay
> more.
>
> Or probably Zhiguo Zhou's version will perform better.
>

Our primary difference lies in the approach to handling the prev-link,
either via the hash table or directly within the XLog buffer. During my
analysis, I didn't identify significant hotspots in the hash table
functions, leading me to believe that both implementations should
achieve comparable performance improvements.

Following your advice, I revised my implementation to update the
prev-link atomically and resolved the known TAP tests. However, I
encountered the last failure in the recovery/t/027_stream_regress.pl
test. Addressing this issue might require a redesign of the underlying
writing convention of XLog, which I believe is not necessary, especially
since your implementation already achieves the desired performance
improvements without suffering from the test failures. I think we may
need to focus on your implementation in the next phase.

> I think, we could measure theoretical benefit by completely ignoring
> fill of xl_prev. I've attached patch "Dumb-lock-free..." so you could
> measure. It passes almost all "recovery" tests, though fails two
> strictly dependent on xl_prev.

I currently don't have access to the high-core-count device, but I plan
to measure the performance impact of your latest patch and the
"Dump-lock-free..." patch once I regain access.
>
> [1] https://postgr.es/m/3b11fdc2-9793-403d-
> b3d4-67ff9a00d447%40postgrespro.ru
>
> ------
>
> regards
> Yura

Hi Yura and Japin,

Thanks so much for your recent patch works and discussions which
inspired me a lot! I agree with you that we need to:
- Align the test approach and environment
- Address the motivation and necessity of this optimization
- Further identify the optimization opportunities after applying Yura's
patch

WDYT?

[1]
https://www.postgresql.org/message-id/6ykez6chr5wfiveuv2iby236mb7ab6fqwpxghppdi5ugb4kdyt%40lkrn4maox2wj

Regards,
Zhiguo

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2025-01-27 09:34:55 Re: Test to dump and restore objects left behind by regression
Previous Message Laurenz Albe 2025-01-27 09:21:39 Re: doc: explain pgstatindex fragmentation