RISC-V animals sporadically produce weird memory-related failures

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, pgbf(at)twiska(dot)com
Subject: RISC-V animals sporadically produce weird memory-related failures
Date: 2024-08-22 09:00:00
Message-ID: 025ea176-3a12-e091-82cb-e5c1e4fe191b@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers,

While investigating a recent copperhead failure [1] with the following
diagnostics:
2024-08-20 20:56:47.318 CEST [2179731:95] LOG:  server process (PID 2184722) was terminated by signal 11: Segmentation fault
2024-08-20 20:56:47.318 CEST [2179731:96] DETAIL:  Failed process was running: COPY hash_f8_heap FROM
'/home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/data/hash.data';

Core was generated by `postgres: pgbf regression [local] COPY                                        '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000002ac8e62674 in heap_multi_insert (relation=0x3f9525c890, slots=0x2ae68a5b30, ntuples=<optimized out>,
cid=<optimized out>, options=<optimized out>, bistate=0x2ae6891c18) at heapam.c:2296
2296            tuple->t_tableOid = slots[i]->tts_tableOid;
#0  0x0000002ac8e62674 in heap_multi_insert (relation=0x3f9525c890, slots=0x2ae68a5b30, ntuples=<optimized out>,
cid=<optimized out>, options=<optimized out>, bistate=0x2ae6891c18) at heapam.c:2296
#1  0x0000002ac8f41656 in table_multi_insert (bistate=<optimized out>, options=<optimized out>, cid=<optimized out>,
nslots=1000, slots=0x2ae68a5b30, rel=<optimized out>) at ../../../src/include/access/tableam.h:1460
#2  CopyMultiInsertBufferFlush (miinfo=miinfo(at)entry=0x3ff87bceb0, buffer=0x2ae68a5b30,
processed=processed(at)entry=0x3ff87bce90) at copyfrom.c:415
#3  0x0000002ac8f41f6c in CopyMultiInsertInfoFlush (processed=0x3ff87bce90, curr_rri=0x2ae67eacf8, miinfo=0x3ff87bceb0)
at copyfrom.c:532
#4  CopyFrom (cstate=cstate(at)entry=0x2ae6897fc0) at copyfrom.c:1242
...
$1 = {si_signo = 11,  ... _sigfault = {si_addr = 0x2ae600cbcc}, ...

I discovered a similarly looking failure, [2]:
2023-02-11 18:33:09.222 CET [2591215:73] LOG:  server process (PID 2596066) was terminated by signal 11: Segmentation fault
2023-02-11 18:33:09.222 CET [2591215:74] DETAIL:  Failed process was running: COPY bt_i4_heap FROM
'/home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/data/desc.data';

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000002adc9bc61a in heap_multi_insert (relation=0x3fa3bd53a8, slots=0x2b098a13c0, ntuples=<optimized out>,
cid=<optimized out>, options=<optimized out>, bistate=0x2b097eda10) at heapam.c:2095
2095            tuple->t_tableOid = slots[i]->tts_tableOid;

But then I found also different failures on copperhead, all looking like
memory-related anomalies:
[3]
Program terminated with signal SIGSEGV, Segmentation fault.
#0  fixempties (f=0x0, nfa=0x2b02a59410) at regc_nfa.c:2246
2246                for (a = inarcsorig[s2->no]; a != NULL; a = a->inchain)

[4]
pgsql.build/src/bin/pg_rewind/tmp_check/log/regress_log_004_pg_xlog_symlink
malloc(): memory corruption (fast)

[5]
2022-11-22 20:22:48.907 CET [1364156:4] LOG:  server process (PID 1364221) was terminated by signal 11: Segmentation fault
2022-11-22 20:22:48.907 CET [1364156:5] DETAIL:  Failed process was running: BASE_BACKUP LABEL 'pg_basebackup base
backup' PROGRESS NOWAIT  TABLESPACE_MAP  MANIFEST 'yes'

[6]
psql exited with signal 11 (core dumped): '' while running 'psql -XAtq -d port=60743 host=/tmp/zHq9Kzn2b5
dbname='postgres' -f - -v ON_ERROR_STOP=1' at
/home/pgbf/buildroot/REL_14_STABLE/pgsql.build/contrib/bloom/../../src/test/perl/PostgresNode.pm line 1855.

[7]
- locktype | classid | objid | objsubid |     mode      | granted
+ locktype | classid | objid | objsubid |     mode      | gr_nted
(the most mysterious case)

[8]
Program terminated with signal SIGSEGV, Segmentation fault.
#0  GetMemoryChunkContext (pointer=0x2b21bca1f8) at ../../../../src/include/utils/memutils.h:128
128        context = *(MemoryContext *) (((char *) pointer) - sizeof(void *));
...
$1 = {si_signo = 11, ... _sigfault = {si_addr = 0x2b21bca1f0}, ...

[9]
Program terminated with signal SIGSEGV, Segmentation fault.
#0  fixempties (f=0x0, nfa=0x2ac0bf4c60) at regc_nfa.c:2246
2246                for (a = inarcsorig[s2->no]; a != NULL; a = a->inchain)

Moreover, the other RISC-V animal, boomslang produced weird failures too:
[10]
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000002ae6b50abe in ExecInterpExpr (state=0x2b20ca0040, econtext=0x2b20c9fba8, isnull=<optimized out>) at
execExprInterp.c:678
678                resultslot->tts_values[resultnum] = state->resvalue;

[11]
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000002addf22728 in ExecInterpExpr (state=0x2ae0af8848, econtext=0x2ae0b16028, isnull=<optimized out>) at
execExprInterp.c:666
666                resultslot->tts_values[resultnum] = scanslot->tts_values[attnum];

[12]
INSERT INTO ftable SELECT * FROM generate_series(1, 70000) i;

Core was generated by `postgres: buildfarm contrib_regression_postgres_fdw [local] INS'.
Program terminated with signal SIGABRT, Aborted.

As far as I can see, these animals run on Debian 10 with the kernel
version 5.15.5-2~bpo11+1 (2022-01-10), but RISC-V was declared an
official Debian architecture on 2023-07-23 [14]. So maybe the OS
version installed is not stable enough for testing...
(I've tried running the regression tests on a RISC-V machine emulated with
qemu, running Debian trixie, kernel version 6.8.12-1 (2024-05-31), and got
no failures.)

Dear copperhead, boomslang owner, could you consider upgrading OS on
these animals to rule out effects of OS anomalies that might be fixed
already? If it's not an option, couldn't you perform stress testing of
these machines, say, with stress-ng?

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2024-08-20%2017%3A59%3A12
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2023-02-11%2016%3A41%3A58
[3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2023-02-09%2001%3A25%3A06
[4] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2023-03-21%2022%3A58%3A43
[5] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2022-11-22%2019%3A00%3A19
[6] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2022-11-24%2018%3A45%3A45
[7] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2023-03-19%2017%3A21%3A17
[8] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2023-03-11%2016%3A54%3A52
[9] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2022-11-11%2021%3A39%3A04
[10] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=boomslang&dt=2023-03-12%2008%3A32%3A48
[11] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=boomslang&dt=2022-09-22%2007%3A38%3A42
[12] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=boomslang&dt=2022-10-18%2006%3A51%3A13
[13] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=boomslang&dt=2022-09-27%2006%3A57%3A38
[14] https://lists.debian.org/debian-riscv/2023/07/msg00053.html

Best regards,
Alexander

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhijie Hou (Fujitsu) 2024-08-22 10:01:27 Collect statistics about conflicts in logical replication
Previous Message Gabriele Bartolini 2024-08-22 08:59:47 Re: RFC: Additional Directory for Extensions