Quick Links

RISC-V animals sporadically produce weird memory-related failures

From:	Alexander Lakhin <exclusion(at)gmail(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, pgbf(at)twiska(dot)com
Subject:	RISC-V animals sporadically produce weird memory-related failures
Date:	2024-08-22 09:00:00
Message-ID:	025ea176-3a12-e091-82cb-e5c1e4fe191b@gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hello hackers,

While investigating a recent copperhead failure [1] with the following
diagnostics:
2024-08-20 20:56:47.318 CEST [2179731:95] LOG: server process (PID 2184722) was terminated by signal 11: Segmentation fault
2024-08-20 20:56:47.318 CEST [2179731:96] DETAIL: Failed process was running: COPY hash_f8_heap FROM
'/home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/data/hash.data';

Core was generated by `postgres: pgbf regression [local] COPY '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000002ac8e62674 in heap_multi_insert (relation=0x3f9525c890, slots=0x2ae68a5b30, ntuples=<optimized out>,
cid=<optimized out>, options=<optimized out>, bistate=0x2ae6891c18) at heapam.c:2296
2296 tuple->t_tableOid = slots[i]->tts_tableOid;
#0 0x0000002ac8e62674 in heap_multi_insert (relation=0x3f9525c890, slots=0x2ae68a5b30, ntuples=<optimized out>,
cid=<optimized out>, options=<optimized out>, bistate=0x2ae6891c18) at heapam.c:2296
#1 0x0000002ac8f41656 in table_multi_insert (bistate=<optimized out>, options=<optimized out>, cid=<optimized out>,
nslots=1000, slots=0x2ae68a5b30, rel=<optimized out>) at ../../../src/include/access/tableam.h:1460
#2 CopyMultiInsertBufferFlush (miinfo=miinfo(at)entry=0x3ff87bceb0, buffer=0x2ae68a5b30,
processed=processed(at)entry=0x3ff87bce90) at copyfrom.c:415
#3 0x0000002ac8f41f6c in CopyMultiInsertInfoFlush (processed=0x3ff87bce90, curr_rri=0x2ae67eacf8, miinfo=0x3ff87bceb0)
at copyfrom.c:532
#4 CopyFrom (cstate=cstate(at)entry=0x2ae6897fc0) at copyfrom.c:1242
...
$1 = {si_signo = 11, ... _sigfault = {si_addr = 0x2ae600cbcc}, ...

I discovered a similarly looking failure, [2]:
2023-02-11 18:33:09.222 CET [2591215:73] LOG: server process (PID 2596066) was terminated by signal 11: Segmentation fault
2023-02-11 18:33:09.222 CET [2591215:74] DETAIL: Failed process was running: COPY bt_i4_heap FROM
'/home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/data/desc.data';

Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000002adc9bc61a in heap_multi_insert (relation=0x3fa3bd53a8, slots=0x2b098a13c0, ntuples=<optimized out>,
cid=<optimized out>, options=<optimized out>, bistate=0x2b097eda10) at heapam.c:2095
2095 tuple->t_tableOid = slots[i]->tts_tableOid;

But then I found also different failures on copperhead, all looking like
memory-related anomalies:
[3]
Program terminated with signal SIGSEGV, Segmentation fault.
#0 fixempties (f=0x0, nfa=0x2b02a59410) at regc_nfa.c:2246
2246 for (a = inarcsorig[s2->no]; a != NULL; a = a->inchain)

[4]
pgsql.build/src/bin/pg_rewind/tmp_check/log/regress_log_004_pg_xlog_symlink
malloc(): memory corruption (fast)

[5]
2022-11-22 20:22:48.907 CET [1364156:4] LOG: server process (PID 1364221) was terminated by signal 11: Segmentation fault
2022-11-22 20:22:48.907 CET [1364156:5] DETAIL: Failed process was running: BASE_BACKUP LABEL 'pg_basebackup base
backup' PROGRESS NOWAIT TABLESPACE_MAP MANIFEST 'yes'

[6]
psql exited with signal 11 (core dumped): '' while running 'psql -XAtq -d port=60743 host=/tmp/zHq9Kzn2b5
dbname='postgres' -f - -v ON_ERROR_STOP=1' at
/home/pgbf/buildroot/REL_14_STABLE/pgsql.build/contrib/bloom/../../src/test/perl/PostgresNode.pm line 1855.

[8]
Program terminated with signal SIGSEGV, Segmentation fault.
#0 GetMemoryChunkContext (pointer=0x2b21bca1f8) at ../../../../src/include/utils/memutils.h:128
128 context = *(MemoryContext *) (((char *) pointer) - sizeof(void *));
...
$1 = {si_signo = 11, ... _sigfault = {si_addr = 0x2b21bca1f0}, ...

[9]
Program terminated with signal SIGSEGV, Segmentation fault.
#0 fixempties (f=0x0, nfa=0x2ac0bf4c60) at regc_nfa.c:2246
2246 for (a = inarcsorig[s2->no]; a != NULL; a = a->inchain)

Moreover, the other RISC-V animal, boomslang produced weird failures too:
[10]
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000002ae6b50abe in ExecInterpExpr (state=0x2b20ca0040, econtext=0x2b20c9fba8, isnull=<optimized out>) at
execExprInterp.c:678
678 resultslot->tts_values[resultnum] = state->resvalue;

[11]
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000002addf22728 in ExecInterpExpr (state=0x2ae0af8848, econtext=0x2ae0b16028, isnull=<optimized out>) at
execExprInterp.c:666
666 resultslot->tts_values[resultnum] = scanslot->tts_values[attnum];

[12]
INSERT INTO ftable SELECT * FROM generate_series(1, 70000) i;

Core was generated by `postgres: buildfarm contrib_regression_postgres_fdw [local] INS'.
Program terminated with signal SIGABRT, Aborted.

As far as I can see, these animals run on Debian 10 with the kernel
version 5.15.5-2~bpo11+1 (2022-01-10), but RISC-V was declared an
official Debian architecture on 2023-07-23 [14]. So maybe the OS
version installed is not stable enough for testing...
(I've tried running the regression tests on a RISC-V machine emulated with
qemu, running Debian trixie, kernel version 6.8.12-1 (2024-05-31), and got
no failures.)

Dear copperhead, boomslang owner, could you consider upgrading OS on
these animals to rule out effects of OS anomalies that might be fixed
already? If it's not an option, couldn't you perform stress testing of
these machines, say, with stress-ng?

Best regards,
Alexander

Responses

Re: RISC-V animals sporadically produce weird memory-related failures at 2024-11-17 17:28:13 from Tom Turelinckx

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Zhijie Hou (Fujitsu)	2024-08-22 10:01:27	Collect statistics about conflicts in logical replication
Previous Message	Gabriele Bartolini	2024-08-22 08:59:47	Re: RFC: Additional Directory for Extensions