Quick Links

Re: BufferAlloc: don't take two simultaneous locks

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	y(dot)sokolov(at)postgrespro(dot)ru
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org, michail(dot)nikolaev(at)gmail(dot)com, x4mmm(at)yandex-team(dot)ru, andres(at)anarazel(dot)de, simon(dot)riggs(at)enterprisedb(dot)com
Subject:	Re: BufferAlloc: don't take two simultaneous locks
Date:	2022-04-07 07:55:16
Message-ID:	20220407.165516.1539873599767239351.horikyota.ntt@gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi, Yura.

At Wed, 06 Apr 2022 16:17:28 +0300, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrot
e in
> Ok, I got access to stronger server, did the benchmark, found weird
> things, and so here is new version :-)

Thanks for the new version and benchmarking.

> First I found if table size is strictly limited to NBuffers and FIXED,
> then under high concurrency get_hash_entry may not find free entry
> despite it must be there. It seems while process scans free lists, other
> concurrent processes "moves etry around", ie one concurrent process
> fetched it from one free list, other process put new entry in other
> freelist, and unfortunate process missed it since it tests freelists
> only once.

StrategyGetBuffer believes that entries don't move across freelists
and it was true before this patch.

> Second, I confirm there is problem with freelist spreading.
> If I keep entry's freelist_idx, then one freelist is crowded.
> If I use new entry's freelist_idx, then one freelist is emptified
> constantly.

Perhaps it is what I saw before. I'm not sure about the details of
how that happens, though.

> Third, I found increased concurrency could harm. When popular block
> is evicted for some reason, then thundering herd effect occures:
> many backends wants to read same block, they evict many other
> buffers, but only one is inserted. Other goes to freelist. Evicted
> buffers by itself reduce cache hit ratio and provocates more
> work. Old version resists this effect by not removing old buffer
> before new entry is successfully inserted.

Nice finding.

> To fix this issues I made following:
>
> # Concurrency
>
> First, I limit concurrency by introducing other lwlocks tranche -
> BufferEvict. It is 8 times larger than BufferMapping tranche (1024 vs
> 128).
> If backend doesn't find buffer in buffer table and wants to introduce
> it, it first calls
> LWLockAcquireOrWait(newEvictPartitionLock, LW_EXCLUSIVE)
> If lock were acquired, then it goes to eviction and replace process.
> Otherwise, it waits lock to be released and repeats search.
>
> This greately improve performance for > 400 clients in pgbench.

So the performance difference between the existing code and v11 is the
latter has a collision cross section eight times smaller than the
former?

+ * Prevent "thundering herd" problem and limit concurrency.

this is something like pressing accelerator and break pedals at the
same time. If it improves performance, just increasing the number of
buffer partition seems to work?

It's also not great that follower backends runs a busy loop on the
lock until the top-runner backend inserts the new buffer to the
buftable then releases the newParititionLock.

> I tried other variant as well:
> - first insert entry with dummy buffer index into buffer table.
> - if such entry were already here, then wait it to be filled.
> - otherwise find victim buffer and replace dummy index with new one.
> Wait were done with shared lock on EvictPartitionLock as well.
> This variant performed quite same.

This one looks better to me. Since a partition can be shared by two or
more new-buffers, condition variable seems to work better here...

> Logically I like that variant more, but there is one gotcha:
> FlushBuffer could fail with elog(ERROR). Therefore then there is
> a need to reliable remove entry with dummy index.

Perhaps UnlockBuffers can do that.

> And after all, I still need to hold EvictPartitionLock to notice
> waiters.
> I've tried to use ConditionalVariable, but its performance were much
> worse.

How many CVs did you use?

> # Dynahash capacity and freelists.
>
> I returned back buffer table initialization:
> - removed FIXES_SIZE restriction introduced in previous version

Mmm. I don't see v10 in this list and v9 doesn't contain FIXES_SIZE..

> - returned `NBuffers + NUM_BUFFER_PARTITIONS`.
> I really think, there should be more spare items, since almost always
> entry_alloc is called at least once (on 128MB shared_buffers). But
> let keep it as is for now.

Maybe s/entry_alloc/element_alloc/ ? :p

I see it with shared_buffers=128kB (not MB) and pgbench -i on master.

The required number of elements are already allocaed to freelists at
hash creation. So the reason for the call is imbalanced use among
freelists. Even in that case other freelists holds elements. So we
don't need to expand the element size.

> `get_hash_entry` were changed to probe NUM_FREELISTS/4 (==8) freelists
> before falling back to `entry_alloc`, and probing is changed from
> linear to quadratic. This greately reduces number of calls to
> `entry_alloc`, so more shared memory left intact. And I didn't notice
> large performance hit from. Probably there is some, but I think it is
> adequate trade-off.

I don't think that causes significant performance hit, but I don't
understand how it improves freelist hit ratio other than by accident.
Could you have some reasoning for it?

By the way the change of get_hash_entry looks something wrong.

If I understand it correctly, it visits num_freelists/4 freelists at
once, then tries element_alloc. If element_alloc() fails (that must
happen), it only tries freeList[freelist_idx] and gives up, even
though there must be an element in other 3/4 freelists.

> `free_reused_entry` now returns entry to random position. It flattens
> free entry's spread. Although it is not enough without other changes
> (thundering herd mitigation and probing more lists in get_hash_entry).

If "thudering herd" means "many backends rush trying to read-in the
same page at once", isn't it avoided by the change in BufferAlloc?

I feel the random returning method might work. I want to get rid of
the randomness here but I don't come up with a better way.

Anyway the code path is used only by buftable so it doesn't harm
generally.

> # Benchmarks

# Thanks for benchmarking!!

> Benchmarked on two socket Xeon(R) Gold 5220 CPU @2.20GHz
> 18 cores per socket + hyper-threading - upto 72 virtual core total.
> turbo-boost disabled
> Linux 5.10.103-1 Debian.
>
> pgbench scale 100 simple_select + simple select with 3 keys (sql file
> attached).
>
> shared buffers 128MB & 1GB
> huge_pages=on
>
> 1 socket
> conns | master | patch-v11 | master 1G | patch-v11 1G
> --------+------------+------------+------------+------------
> 1 | 27882 | 27738 | 32735 | 32439
> 2 | 54082 | 54336 | 64387 | 63846
> 3 | 80724 | 81079 | 96387 | 94439
> 5 | 134404 | 133429 | 160085 | 157399
> 7 | 185977 | 184502 | 219916 | 217142

v11+128MB degrades above here..

> 17 | 335345 | 338214 | 393112 | 388796
> 27 | 393686 | 394948 | 447945 | 444915
> 53 | 572234 | 577092 | 678884 | 676493
> 83 | 558875 | 561689 | 669212 | 655697
> 107 | 553054 | 551896 | 654550 | 646010
> 139 | 541263 | 538354 | 641937 | 633840
> 163 | 532932 | 531829 | 635127 | 627600
> 191 | 524647 | 524442 | 626228 | 617347
> 211 | 521624 | 522197 | 629740 | 613143

v11+1GB degrades above here..

> 239 | 509448 | 554894 | 652353 | 652972
> 271 | 468190 | 557467 | 647403 | 661348
> 307 | 454139 | 558694 | 642229 | 657649
> 353 | 446853 | 554301 | 635991 | 654571
> 397 | 441909 | 549822 | 625194 | 647973
>
> 1 socket 3 keys
>
> conns | master | patch-v11 | master 1G | patch-v11 1G
> --------+------------+------------+------------+------------
> 1 | 16677 | 16477 | 22219 | 22030
> 2 | 32056 | 31874 | 43298 | 43153
> 3 | 48091 | 47766 | 64877 | 64600
> 5 | 78999 | 78609 | 105433 | 106101
> 7 | 108122 | 107529 | 148713 | 145343

v11+128MB degrades above here..

> 17 | 205656 | 209010 | 272676 | 271449
> 27 | 252015 | 254000 | 323983 | 323499

v11+1GB degrades above here..

> 53 | 317928 | 334493 | 446740 | 449641
> 83 | 299234 | 327738 | 437035 | 443113
> 107 | 290089 | 322025 | 430535 | 431530
> 139 | 277294 | 314384 | 422076 | 423606
> 163 | 269029 | 310114 | 416229 | 417412
> 191 | 257315 | 306530 | 408487 | 416170
> 211 | 249743 | 304278 | 404766 | 416393
> 239 | 243333 | 310974 | 397139 | 428167
> 271 | 236356 | 309215 | 389972 | 427498
> 307 | 229094 | 307519 | 382444 | 425891
> 353 | 224385 | 305366 | 375020 | 423284
> 397 | 218549 | 302577 | 364373 | 420846
>
> 2 sockets
>
> conns | master | patch-v11 | master 1G | patch-v11 1G
> --------+------------+------------+------------+------------
> 1 | 27287 | 27631 | 32943 | 32493
> 2 | 52397 | 54011 | 64572 | 63596
> 3 | 76157 | 80473 | 93363 | 93528
> 5 | 127075 | 134310 | 153176 | 149984
> 7 | 177100 | 176939 | 216356 | 211599
> 17 | 379047 | 383179 | 464249 | 470351
> 27 | 545219 | 546706 | 664779 | 662488
> 53 | 728142 | 728123 | 857454 | 869407
> 83 | 918276 | 957722 | 1215252 | 1203443

v11+1GB degrades above here..

> 107 | 884112 | 971797 | 1206930 | 1234606
> 139 | 822564 | 970920 | 1167518 | 1233230
> 163 | 788287 | 968248 | 1130021 | 1229250
> 191 | 772406 | 959344 | 1097842 | 1218541
> 211 | 756085 | 955563 | 1077747 | 1209489
> 239 | 732926 | 948855 | 1050096 | 1200878
> 271 | 692999 | 941722 | 1017489 | 1194012
> 307 | 668241 | 920478 | 994420 | 1179507
> 353 | 642478 | 908645 | 968648 | 1174265
> 397 | 617673 | 893568 | 950736 | 1173411
>
> 2 sockets 3 keys
>
> conns | master | patch-v11 | master 1G | patch-v11 1G
> --------+------------+------------+------------+------------
> 1 | 16722 | 16393 | 20340 | 21813
> 2 | 32057 | 32009 | 39993 | 42959
> 3 | 46202 | 47678 | 59216 | 64374
> 5 | 78882 | 72002 | 98054 | 103731
> 7 | 103398 | 99538 | 135098 | 135828

v11+128MB degrades above here..

> 17 | 205863 | 217781 | 293958 | 299690
> 27 | 283526 | 290539 | 414968 | 411219
> 53 | 336717 | 356130 | 460596 | 474563
> 83 | 307310 | 342125 | 419941 | 469989
> 107 | 294059 | 333494 | 405706 | 469593
> 139 | 278453 | 328031 | 390984 | 470553
> 163 | 270833 | 326457 | 384747 | 470977
> 191 | 259591 | 322590 | 376582 | 470335
> 211 | 263584 | 321263 | 375969 | 469443
> 239 | 257135 | 316959 | 370108 | 470904
> 271 | 251107 | 315393 | 365794 | 469517
> 307 | 246605 | 311585 | 360742 | 467566
> 353 | 236899 | 308581 | 353464 | 466936
> 397 | 249036 | 305042 | 344673 | 466842
>
> I skipped v10 since I used it internally for variant
> "insert entry with dummy index then search victim".

Up to about 15%(?) of gain is great.
I'm not sure it is okay that it seems to slow by about 1%..

Ah, I see.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Re: BufferAlloc: don't take two simultaneous locks at 2022-04-06 13:17:28 from Yura Sokolov

Responses

Re: BufferAlloc: don't take two simultaneous locks at 2022-04-07 11:14:59 from Yura Sokolov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Julien Rouhaud	2022-04-07 07:55:45	Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN
Previous Message	Gunnar "Nick" Bluth	2022-04-07 07:51:01	Re: PATCH: add "--config-file=" option to pg_rewind