Re: Remove last traces of HPPA support

From: Andres Freund <andres(at)anarazel(dot)de>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Remove last traces of HPPA support
Date: 2024-07-31 19:45:15
Message-ID: 20240731194515.ukmml5mjtg5kn7sk@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2024-07-31 22:32:19 +1200, Thomas Munro wrote:
> > That old comment means that both SpinLockAcquire() and SpinLockRelease()
> > acted as full memory barriers, and looking at the implementations, that
> > was indeed so. With the new implementation, SpinLockAcquire() will have
> > "acquire semantics" and SpinLockRelease will have "release semantics".
> > That's very sensible, and I don't believe it will break anything, but
> > it's a change in semantics nevertheless.
>
> Yeah. It's interesting that our pg_atomic_clear_flag(f) is like
> standard atomic_flag_clear_explicit(f, memory_order_release), not like
> atomic_flag_clear(f) which is short for atomic_flag_clear_explicit(f,
> memory_order_seq_cst). Example spinlock code I've seen written in
> modern C or C++ therefore uses the _explicit variants, so it can get
> acquire/release, which is what people usually want from a lock-like
> thing. What's a good way to test the performance in PostgreSQL?

I've used
c=8;pgbench -n -Mprepared -c$c -j$c -P1 -T10 -f <(echo "SELECT pg_logical_emit_message(false, \:client_id::text, '1'), generate_series(1, 1000) OFFSET 1000;")
in the past. Because of NUM_XLOGINSERT_LOCKS = 8 this ends up with 8 backends
doing tiny xlog insertions and heavily contending on insertpos_lck.

The generate_series() is necessary as otherwise the context switch and
executor startup overhead dominates.

> In a naive loop that just test-and-sets and clears a flag a billion times in
> a loop and does nothing else, I see 20-40% performance increase depending on
> architecture when comparing _seq_cst with _acquire/_release.

I'd expect the difference to be even bigger on concurrent workloads on x86-64
- the added memory barrier during lock release really hurts. I have a test
program to play around with this and the difference in isolation is like 0.4x
the throughput with a full barrier release on my older 2 socket workstation
[1]. Of course it's not trivial to hit "pure enough" cases in the real world.

On said workstation [1], with the above pgbench, I get ~1.95M inserts/sec
(1959 TPS * 1000) on HEAD and 1.80M insert/sec after adding
#define S_UNLOCK(lock) __atomic_store_n(lock, 0, __ATOMIC_SEQ_CST)

If I change NUM_XLOGINSERT_LOCKS = 40 and use 40 clients, I get
1.03M inserts/sec with the current code and 0.86M inserts/sec with
__ATOMIC_SEQ_CST.

Greetings,

Andres Freund

[1] 2x Xeon Gold 5215

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2024-07-31 19:54:29 Re: can we mark upper/lower/textlike functions leakproof?
Previous Message Jeff Davis 2024-07-31 19:44:36 Re: [17+] check after second call to pg_strnxfrm is too strict, relax it