From: | Japin Li <japinli(at)hotmail(dot)com> |
---|---|
To: | "Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com> |
Cc: | Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, wenhui qiu <qiuwenhuifx(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: [RFC] Lock-free XLog Reservation from WAL |
Date: | 2025-02-05 08:32:30 |
Message-ID: | ME0P300MB0445AF9F54F2B738EE09AE61B6F72@ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, 27 Jan 2025 at 17:30, "Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com> wrote:
> On 1/26/2025 10:59 PM, Yura Sokolov wrote:
>> 24.01.2025 12:07, Japin Li пишет:
>>> On Thu, 23 Jan 2025 at 21:44, Japin Li <japinli(at)hotmail(dot)com> wrote:
>>>> On Thu, 23 Jan 2025 at 15:03, Yura Sokolov
>>>> <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
>>>>> 23.01.2025 11:46, Japin Li пишет:
>>>>>> On Wed, 22 Jan 2025 at 22:44, Japin Li <japinli(at)hotmail(dot)com> wrote:
>>>>>>> On Wed, 22 Jan 2025 at 17:02, Yura Sokolov
>>>>>>> <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
>>>>>>>> I believe, I know why it happens: I was in hurry making v2 by
>>>>>>>> cherry-picking internal version. I reverted some changes in
>>>>>>>> CalcCuckooPositions manually and forgot to add modulo
>>>>>>>> PREV_LINKS_HASH_CAPA.
>>>>>>>>
>>>>>>>> Here's the fix:
>>>>>>>>
>>>>>>>> pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
>>>>>>>> - pos->pos[1] = pos->pos[0] + 1;
>>>>>>>> + pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
>>>>>>>> pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
>>>>>>>> - pos->pos[3] = pos->pos[2] + 2;
>>>>>>>> + pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;
>>>>>>>>
>>>>>>>> Any way, here's v3:
>>>>>>>> - excess file "v0-0001-Increase..." removed. I believe it was source
>>>>>>>> of white-space apply warnings.
>>>>>>>> - this mistake fixed
>>>>>>>> - more clear slots strategies + "8 positions in two
>>>>>>>> cache-lines" strategy.
>>>>>>>>
>>>>>>>> You may play with switching PREV_LINKS_HASH_STRATEGY to 2 or 3
>>>>>>>> and see
>>>>>>>> if it affects measurably.
>>>>>>>
>>>>>>> Thanks for your quick fixing. I will retest it tomorrow.
>>>>>> Hi, Yura Sokolov
>>>>>> Here is my test result of the v3 patch:
>>>>>> | case | min | avg | max
>>>>>> |
>>>>>> |-------------------------------+------------+------------
>>>>>> +------------|
>>>>>> | master (44b61efb79) | 865,743.55 | 871,237.40 |
>>>>>> 874,492.59 |
>>>>>> | v3 | 857,020.58 | 860,180.11 |
>>>>>> 864,355.00 |
>>>>>> | v3 PREV_LINKS_HASH_STRATEGY=2 | 853,187.41 | 855,796.36 |
>>>>>> 858,436.44 |
>>>>>> | v3 PREV_LINKS_HASH_STRATEGY=3 | 863,131.97 | 864,272.91 |
>>>>>> 865,396.42 |
>>>>>> It seems there are some performance decreases :( or something I
>>>>>> missed?
>>>>>>
>>>>>
>>>>> Hi, Japin.
>>>>> (Excuse me for duplicating message, I found I sent it only to you
>>>>> first time).
>>>>>
>>>> Never mind!
>>>>
>>>>> v3 (as well as v2) doesn't increase NUM_XLOGINSERT_LOCKS itself.
>>>>> With only 8 in-progress inserters spin-lock is certainly better
>>>>> than any
>>>>> more complex solution.
>>>>>
>>>>> You need to compare "master" vs "master + NUM_XLOGINSERT_LOCKS=64" vs
>>>>> "master + NUM_XLOGINSERT_LOCKS=64 + v3".
>>>>>
>>>>> And even this way I don't claim "Lock-free reservation" gives any
>>>>> profit.
>>>>>
>>>>> That is why your benchmarking is very valuable! It could answer, does
>>>>> we need such not-small patch, or there is no real problem at all?
>>>>>
>>>
>>> Hi, Yura Sokolov
>>>
>>> Here is the test result compared with NUM_XLOGINSERT_LOCKS and the
>>> v3 patch.
>>>
>>> | case | min | avg |
>>> max | rate% |
>>> |-----------------------+--------------+--------------+--------------
>>> +-------|
>>> | master (4108440) | 891,225.77 | 904,868.75 |
>>> 913,708.17 | |
>>> | lock 64 | 1,007,716.95 | 1,012,013.22 |
>>> 1,018,674.00 | 11.84 |
>>> | lock 64 attempt 1 | 1,016,716.07 | 1,017,735.55 |
>>> 1,019,328.36 | 12.47 |
>>> | lock 64 attempt 2 | 1,015,328.31 | 1,018,147.74 |
>>> 1,021,513.14 | 12.52 |
>>> | lock 128 | 1,010,147.38 | 1,014,128.11 |
>>> 1,018,672.01 | 12.07 |
>>> | lock 128 attempt 1 | 1,018,154.79 | 1,023,348.35 |
>>> 1,031,365.42 | 13.09 |
>>> | lock 128 attempt 2 | 1,013,245.56 | 1,018,984.78 |
>>> 1,023,696.00 | 12.61 |
>>> | lock 64 v3 | 1,010,893.30 | 1,022,787.25 |
>>> 1,029,200.26 | 13.03 |
>>> | lock 64 attempt 1 v3 | 1,014,961.21 | 1,019,745.09 |
>>> 1,025,511.62 | 12.70 |
>>> | lock 64 attempt 2 v3 | 1,015,690.73 | 1,018,365.46 |
>>> 1,020,200.57 | 12.54 |
>>> | lock 128 v3 | 1,012,653.14 | 1,013,637.09 |
>>> 1,014,358.69 | 12.02 |
>>> | lock 128 attempt 1 v3 | 1,008,027.57 | 1,016,849.87 |
>>> 1,024,597.15 | 12.38 |
>>> | lock 128 attempt 2 v3 | 1,020,552.04 | 1,024,658.92 |
>>> 1,027,855.90 | 13.24 |
>
> The data looks really interesting and I recognize the need for further
> investigation. I'm not very familiar with BenchmarkSQL but we've done
> similar tests with HammerDB/TPCC by solely increasing
> NUM_XLOGINSERT_LOCKS from 8 to 128, and we observed a significant
> performance drop of ~50% and the cycle ratio of spinlock acquisition
> (s_lock) rose to over 60% of the total, which is basically consistent
> with the previous findings in [1].
>
> Could you please share the details of your test environment, including
> the device, configuration, and test approach, so we can collaborate on
> understanding the differences?
Sorry for the late reply. I'm on my vacation.
I use Hygon C86 7490 64-core, it has 8 NUMA nodes with 1.5T memory, and
I use 0-3 run the database, and 4-7 run the BenchmarkSQL.
Here is my database settings:
listen_addresses = '*'
max_connections = '1050'
shared_buffers = '100GB'
work_mem = '64MB'
maintenance_work_mem = '512MB'
max_wal_size = '50GB'
min_wal_size = '10GB'
random_page_cost = '1.1'
wal_buffers = '1GB'
wal_level = 'minimal'
max_wal_senders = '0'
wal_sync_method = 'open_datasync'
wal_compression = 'lz4'
track_activities = 'off'
checkpoint_timeout = '1d'
checkpoint_completion_target = '0.95'
effective_cache_size = '300GB'
effective_io_concurrency = '32'
update_process_title = 'off'
password_encryption = 'md5'
huge_pages = 'on'
>> Sorry for pause, it was my birthday, so I was on short vacation.
>> So, in total:
>> - increasing NUM_XLOGINSERT_LOCKS to 64 certainly helps
>> - additional lock attempts seems to help a bit in this benchmark,
>> but it helps more in other (rather synthetic) benchmark [1]
>> - my version of lock-free reservation looks to help a bit when
>> applied alone, but look strange in conjunction with additional
>> lock attempts.
>> I don't see small improvement from my version of Lock-Free
>> reservation
>> (1.1% = 1023/1012) pays for its complexity at the moment.
>
> Due to limited hardware resources, I only had the opportunity to
> measure the performance impact of your v1 patch of the lock-free hash
> table with 64 NUM_XLOGINSERT_LOCKS and the two lock attempt patch. I
> observed an improvement of *76.4%* (RSD: 4.1%) when combining them
> together on the SPR with 480 vCPUs. I understand that your test
> devices may not have as many cores, which might be why this
> optimization brings an unnoticeable impact. However, I don't think
> this is an unreal problem. In fact, this issue was raised by our
> customer who is trying to deploy Postgres on devices with hundreds of
> cores, and I believe the resolution of this performance issue would
> result in real impacts.
>
>> Probably, when other places will be optimized/improved, it will pay
>> more.
>> Or probably Zhiguo Zhou's version will perform better.
>>
>
> Our primary difference lies in the approach to handling the prev-link,
> either via the hash table or directly within the XLog buffer. During
> my analysis, I didn't identify significant hotspots in the hash table
> functions, leading me to believe that both implementations should
> achieve comparable performance improvements.
>
> Following your advice, I revised my implementation to update the
> prev-link atomically and resolved the known TAP tests. However, I
> encountered the last failure in the recovery/t/027_stream_regress.pl
> test. Addressing this issue might require a redesign of the underlying
> writing convention of XLog, which I believe is not necessary,
> especially since your implementation already achieves the desired
> performance improvements without suffering from the test failures. I
> think we may need to focus on your implementation in the next phase.
>
>> I think, we could measure theoretical benefit by completely ignoring
>> fill of xl_prev. I've attached patch "Dumb-lock-free..." so you could
>> measure. It passes almost all "recovery" tests, though fails two
>> strictly dependent on xl_prev.
>
> I currently don't have access to the high-core-count device, but I
> plan to measure the performance impact of your latest patch and the
> "Dump-lock-free..." patch once I regain access.
>> [1] https://postgr.es/m/3b11fdc2-9793-403d-
>> b3d4-67ff9a00d447%40postgrespro.ru
>> ------
>> regards
>> Yura
>
> Hi Yura and Japin,
>
> Thanks so much for your recent patch works and discussions which
> inspired me a lot! I agree with you that we need to:
> - Align the test approach and environment
> - Address the motivation and necessity of this optimization
> - Further identify the optimization opportunities after applying
> Yura's patch
>
> WDYT?
>
> [1]
> https://www.postgresql.org/message-id/6ykez6chr5wfiveuv2iby236mb7ab6fqwpxghppdi5ugb4kdyt%40lkrn4maox2wj
>
> Regards,
> Zhiguo
--
Regrads,
Japin Li
From | Date | Subject | |
---|---|---|---|
Next Message | Álvaro Herrera | 2025-02-05 08:44:40 | Re: Restrict publishing of partitioned table with a foreign table as partition |
Previous Message | Benoit Lobréau | 2025-02-05 08:31:13 | Re: Logging parallel worker draught |