Re: BUG #18732: Segfault in pgbench on max_connections starvation

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: mikhail(at)neon(dot)tech, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18732: Segfault in pgbench on max_connections starvation
Date: 2024-12-03 14:52:32
Message-ID: 54bbc27e-73d8-4e56-9fcd-99f2de52ca97@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 03/12/2024 14:23, PG Bug reporting form wrote:
> When --client connections in pgbench exceed max_connections in postgres,
> pgbench 16 sometimes exits with segfault when a (presumably) ssl
> certificate
> validation error occurs.
>
> ...
>
> Steps to reproduce:
> 1. Launch a postgres server with max_connections=900
> 2. Launch pgbench a couple of times with -c 2000
>
> I was also able to reproduce this error by running multiple pgbench
> instances
> with same launch parameters. This error doesn't reproduce on pgbench 17.2 or
> 15.10
> I can provide the coredump upon request.

I was able to reproduce this on both REL_16_STABLE and REL_17_STABLE.
Didn't try v15, but I presume this issue is present in all branches (see
analysis below).

Backtrace from thread 1:

#0 0x00007f19dfa55516 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#1 0x00007f19dfa55bce in OPENSSL_LH_retrieve () from
/lib/x86_64-linux-gnu/libcrypto.so.3
#2 0x00007f19dfb456d5 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#3 0x00007f19dfa2e943 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#4 0x00007f19dfa2edc1 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#5 0x00007f19dfa17eee in EVP_MD_fetch () from
/lib/x86_64-linux-gnu/libcrypto.so.3
#6 0x00007f19dfa1855b in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#7 0x00007f19dfa4c22a in HMAC_Init_ex () from
/lib/x86_64-linux-gnu/libcrypto.so.3
#8 0x00007f19e00a9296 in pg_hmac_init (ctx=ctx(at)entry=0x7f19cc51bb90,
key=key(at)entry=0x7f19cc50d560 "foo", len=len(at)entry=3) at
../src/common/hmac_openssl.c:180
#9 0x00007f19e00a62b0 in scram_SaltedPassword (password=0x7f19cc50d560
"foo", hash_type=<optimized out>, key_length=32, salt=<optimized out>,
saltlen=<optimized out>, iterations=4096,
result=0x7f19cc51bb08
"w\351אI\256\035\330\003y\021ւ\205\327ƿ\217Q\332\362}\a\0364\243^\324\321a\034H0\250P\314\031\177",
errstr=0x7f19dd4bb928) at ../src/common/scram-common.c:87
#10 0x00007f19e0089bcd in calculate_client_proof (state=0x7f19cc51bae0,
client_final_message_without_proof=0x7f19cc50b040
"c=cD10bHMtc2VydmVyLWVuZC1wb2ludCwsvkIO06ZPSH1cmElOgC2DbPafilVET0yej6RhzH30Rzw=,r=Wkk2fofG+RP23HT1tBMqx0ijin6taf2xdjPuJBYqBqw2853/",

result=<optimized out>, errstr=<optimized out>) at
../src/interfaces/libpq/fe-auth-scram.c:788
#11 build_client_final_message (state=0x7f19cc51bae0) at
../src/interfaces/libpq/fe-auth-scram.c:565
#12 scram_exchange (opaq=0x7f19cc51bae0, input=<optimized out>,
inputlen=<optimized out>, output=0x7f19dd4bba28, outputlen=<optimized
out>, done=<optimized out>, success=<optimized out>)
at ../src/interfaces/libpq/fe-auth-scram.c:255
#13 0x00007f19e008a642 in pg_SASL_continue (conn=0x7f19cc4ff1f0,
payloadlen=84, final=<optimized out>) at
../src/interfaces/libpq/fe-auth.c:654
#14 pg_fe_sendauth (areq=11, payloadlen=84,
conn=conn(at)entry=0x7f19cc4ff1f0) at ../src/interfaces/libpq/fe-auth.c:1139
#15 0x00007f19e008f756 in PQconnectPoll (conn=conn(at)entry=0x7f19cc4ff1f0)
at ../src/interfaces/libpq/fe-connect.c:3802
#16 0x00007f19e008bae8 in connectDBComplete
(conn=conn(at)entry=0x7f19cc4ff1f0) at
../src/interfaces/libpq/fe-connect.c:2511
#17 0x00007f19e008b2bf in PQconnectdbParams
(keywords=keywords(at)entry=0x7f19dd4bc1f0,
values=values(at)entry=0x7f19dd4bc1b0, expand_dbname=expand_dbname(at)entry=1)
at ../src/interfaces/libpq/fe-connect.c:685
#18 0x000056350c35efa5 in doConnect () at ../src/bin/pgbench/pgbench.c:1560
#19 0x000056350c35f2c5 in threadRun (arg=0x56350d1184a0) at
../src/bin/pgbench/pgbench.c:7396
#20 0x00007f19dfe1b112 in start_thread (arg=<optimized out>) at
./nptl/pthread_create.c:447
#21 0x00007f19dfe998f8 in __GI___clone3 () at
../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 2:

#0 0x00007f19dfe28a04 in _int_free_merge_chunk
(av=av(at)entry=0x7f19dff70ac0 <main_arena>, p=0x56350d126280, size=144) at
./malloc/malloc.c:4675
#1 0x00007f19dfe28d31 in _int_free (av=0x7f19dff70ac0 <main_arena>,
p=<optimized out>, have_lock=<optimized out>, have_lock(at)entry=0) at
./malloc/malloc.c:4646
#2 0x00007f19dfe2b4ff in __GI___libc_free (mem=<optimized out>) at
./malloc/malloc.c:3398
#3 0x00007f19dfa5580e in OPENSSL_LH_free () from
/lib/x86_64-linux-gnu/libcrypto.so.3
#4 0x00007f19dfb4489f in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#5 0x00007f19dfa6e0e7 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#6 0x00007f19dfb44c35 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#7 0x00007f19dfa565a5 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#8 0x00007f19dfa56aa0 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#9 0x00007f19dfa5ac32 in OPENSSL_cleanup () from
/lib/x86_64-linux-gnu/libcrypto.so.3
#10 0x00007f19dfdcb1e1 in __run_exit_handlers (status=status(at)entry=1,
listp=0x7f19dff70680 <__exit_funcs>,
run_list_atexit=run_list_atexit(at)entry=true, run_dtors=run_dtors(at)entry=true)
at ./stdlib/exit.c:108
#11 0x00007f19dfdcb29a in __GI_exit (status=status(at)entry=1) at
./stdlib/exit.c:138
#12 0x000056350c362ae6 in threadRun (arg=<optimized out>) at
../src/bin/pgbench/pgbench.c:7399
#13 0x00007f19dfe1b112 in start_thread (arg=<optimized out>) at
./nptl/pthread_create.c:447
#14 0x00007f19dfe998f8 in __GI___clone3 () at
../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Sometimes you also get this error instead of a crash, which is
presumably another symptom of the same race condition:

pgbench (16.6, server 18devel)
starting vacuum...end.
pgbench: error: connection to server at "localhost" (::1), port 5432
failed: FATAL: sorry, too many clients already
pgbench: error: could not create connection for client 1145
pgbench: error: connection to server at "localhost" (::1), port 5432
failed: could not verify server signature: OpenSSL failure

Once I also got this:

pgbench (17.2, server 18devel)
starting vacuum...end.
pgbench: error: connection to server at "localhost" (::1), port 5432
failed: FATAL: sorry, too many clients already
pgbench: error: could not create connection for client 1045
k5_mutex_lock: Received error 22 (Invalid argument)
*** %n in writable segment detected ***

It looks like a race condition between OpenSSL's exit handler and the .
HMAC_Init_ex() call in another thread. I think we could use the
OPENSSL_INIT_NO_ATEXIT option to prevent the atexit handler from
running. The OpenSSL man page on OPENSSL_init_crypto says:

> OPENSSL_INIT_NO_ATEXIT
>
> By default OpenSSL will attempt to clean itself up when the process
> exits via an "atexit" handler. Using this option suppresses that
> behaviour. This means that the application will have to clean up
> OpenSSL explicitly using OPENSSL_cleanup().

I don't understand why that cleanup would be needed. When the program
exits, all resources are gone anyway.

--
Heikki Linnakangas
Neon (https://neon.tech)

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andres Freund 2024-12-03 15:27:41 Re: BUG #18732: Segfault in pgbench on max_connections starvation
Previous Message David G. Johnston 2024-12-03 13:48:04 Re: BUG #18730: Inequality comparison operators and SMALLINT negative immediate value