Re: Server crash on RHEL 9/s390x platform against PG16

From: Suraj Kharage <suraj(dot)kharage(at)enterprisedb(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Server crash on RHEL 9/s390x platform against PG16
Date: 2023-10-09 02:51:18
Message-ID: CAF1DzPUV9zhJNXr_npGrZCi3d+__Ob4F1bZx0g4k80zK5_3muA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

It looks like an issue with JIT. If I disable the JIT then the above query
runs successfully.

postgres=# set jit to off;

SET

postgres=# SELECT * FROM rm32044_t1 LEFT JOIN rm32044_t2 ON rm32044_t1.pkey
= rm32044_t2.pkey, rm32044_t3 LEFT JOIN rm32044_t4 ON rm32044_t3.pkey =
rm32044_t4.pkey order by rm32044_t1.pkey,label,hidden;

pkey | val | pkey | label | hidden | pkey | val | pkey

------+------+------+---------+--------+------+-----+------

1 | row1 | 1 | hidden | t | 1 | 1 |

1 | row1 | 1 | hidden | t | 2 | 1 |

2 | row2 | 2 | visible | f | 1 | 1 |

2 | row2 | 2 | visible | f | 2 | 1 |

(4 rows)

Any idea on this?

On Mon, Sep 18, 2023 at 11:20 AM Suraj Kharage <
suraj(dot)kharage(at)enterprisedb(dot)com> wrote:

> Few more details on this:
>
> (gdb) p val
> $1 = 0
> (gdb) p i
> $2 = 3
> (gdb) f 3
> #3 0x0000000001a1ef70 in ExecCopySlotMinimalTuple (slot=0x202e4f8) at
> ../../../../src/include/executor/tuptable.h:472
> 472 return slot->tts_ops->copy_minimal_tuple(slot);
> (gdb) p *slot
> $3 = {type = T_TupleTableSlot, tts_flags = 16, tts_nvalid = 8, tts_ops =
> 0x1b6dcc8 <TTSOpsVirtual>, tts_tupleDescriptor = 0x202e0e8, tts_values =
> 0x202e540, tts_isnull = 0x202e580, tts_mcxt = 0x1f54550, tts_tid =
> {ip_blkid = {bi_hi = 65535,
> bi_lo = 65535}, ip_posid = 0}, tts_tableOid = 0}
> (gdb) p *slot->tts_tupleDescriptor
> $2 = {natts = 8, tdtypeid = 2249, tdtypmod = -1, tdrefcount = -1, constr =
> 0x0, attrs = 0x202cd28}
>
> (gdb) p slot.tts_values[3]
> $4 = 0
> (gdb) p slot.tts_values[2]
> $5 = 1
> (gdb) p slot.tts_values[1]
> $6 = 34027556
>
>
> As per the resultslot, it has 0 value for the third attribute (column
> lable).
> Im testing this on the docker container and facing some issues with gdb
> hence could not able to debug it further.
>
> Here is a explain plan:
>
> postgres=# explain (verbose, costs off) SELECT * FROM rm32044_t1 LEFT JOIN
> rm32044_t2 ON rm32044_t1.pkey = rm32044_t2.pkey, rm32044_t3 LEFT JOIN
> rm32044_t4 ON rm32044_t3.pkey = rm32044_t4.pkey order by
> rm32044_t1.pkey,label,hidden;
>
> QUERY PLAN
>
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------
> Incremental Sort
> Output: rm32044_t1.pkey, rm32044_t1.val, rm32044_t2.pkey,
> rm32044_t2.label, rm32044_t2.hidden, rm32044_t3.pkey, rm32044_t3.val,
> rm32044_t4.pkey
> Sort Key: rm32044_t1.pkey, rm32044_t2.label, rm32044_t2.hidden
> Presorted Key: rm32044_t1.pkey
> -> Merge Left Join
> Output: rm32044_t1.pkey, rm32044_t1.val, rm32044_t2.pkey,
> rm32044_t2.label, rm32044_t2.hidden, rm32044_t3.pkey, rm32044_t3.val,
> rm32044_t4.pkey
> Merge Cond: (rm32044_t1.pkey = rm32044_t2.pkey)
> -> Sort
> Output: rm32044_t3.pkey, rm32044_t3.val, rm32044_t4.pkey,
> rm32044_t1.pkey, rm32044_t1.val
> Sort Key: rm32044_t1.pkey
> -> Nested Loop
> Output: rm32044_t3.pkey, rm32044_t3.val,
> rm32044_t4.pkey, rm32044_t1.pkey, rm32044_t1.val
> -> Merge Left Join
> Output: rm32044_t3.pkey, rm32044_t3.val,
> rm32044_t4.pkey
> Merge Cond: (rm32044_t3.pkey = rm32044_t4.pkey)
> -> Sort
> Output: rm32044_t3.pkey, rm32044_t3.val
> Sort Key: rm32044_t3.pkey
> -> Seq Scan on public.rm32044_t3
> Output: rm32044_t3.pkey,
> rm32044_t3.val
> -> Sort
> Output: rm32044_t4.pkey
> Sort Key: rm32044_t4.pkey
> -> Seq Scan on public.rm32044_t4
> Output: rm32044_t4.pkey
> -> Materialize
> Output: rm32044_t1.pkey, rm32044_t1.val
> -> Seq Scan on public.rm32044_t1
> Output: rm32044_t1.pkey, rm32044_t1.val
> -> Sort
> Output: rm32044_t2.pkey, rm32044_t2.label, rm32044_t2.hidden
> Sort Key: rm32044_t2.pkey
> -> Seq Scan on public.rm32044_t2
> Output: rm32044_t2.pkey, rm32044_t2.label,
> rm32044_t2.hidden
> (34 rows)
>
>
> It seems like while building the innerslot for merge join, the value for
> attnum 1 is not getting fetched correctly.
>
> On Tue, Sep 12, 2023 at 3:27 PM Suraj Kharage <
> suraj(dot)kharage(at)enterprisedb(dot)com> wrote:
>
>> Hi,
>>
>> Found server crash on RHEL 9/s390x platform with below test case -
>>
>> *Machine details:*
>>
>>
>>
>>
>>
>>
>>
>> *[edb(at)9428da9d2137 postgres]$ cat /etc/redhat-release AlmaLinux release
>> 9.2 (Turquoise Kodkod)[edb(at)9428da9d2137 postgres]$ lscpuArchitecture:
>> s390x CPU op-mode(s): 32-bit, 64-bit Address sizes: 39
>> bits physical, 48 bits virtual Byte Order: Big Endian*
>> *Configure command:*
>> ./configure --prefix=/home/edb/postgres/ --with-lz4 --with-zstd
>> --with-llvm --with-perl --with-python --with-tcl --with-openssl
>> --enable-nls --with-libxml --with-libxslt --with-systemd --with-libcurl
>> --without-icu --enable-debug --enable-cassert --with-pgport=5414
>>
>>
>> *Test case:*
>> CREATE TABLE rm32044_t1
>> (
>> pkey integer,
>> val text
>> );
>> CREATE TABLE rm32044_t2
>> (
>> pkey integer,
>> label text,
>> hidden boolean
>> );
>> CREATE TABLE rm32044_t3
>> (
>> pkey integer,
>> val integer
>> );
>> CREATE TABLE rm32044_t4
>> (
>> pkey integer
>> );
>> insert into rm32044_t1 values ( 1 , 'row1');
>> insert into rm32044_t1 values ( 2 , 'row2');
>> insert into rm32044_t2 values ( 1 , 'hidden', true);
>> insert into rm32044_t2 values ( 2 , 'visible', false);
>> insert into rm32044_t3 values (1 , 1);
>> insert into rm32044_t3 values (2 , 1);
>>
>> postgres=# SELECT * FROM rm32044_t1 LEFT JOIN rm32044_t2 ON
>> rm32044_t1.pkey = rm32044_t2.pkey, rm32044_t3 LEFT JOIN rm32044_t4 ON
>> rm32044_t3.pkey = rm32044_t4.pkey order by rm32044_t1.pkey,label,hidden;
>> server closed the connection unexpectedly
>> This probably means the server terminated abnormally
>> before or while processing the request.
>> The connection to the server was lost. Attempting reset: Failed.
>> The connection to the server was lost. Attempting reset: Failed.
>>
>> *backtrace:*
>> [edb(at)9428da9d2137 postgres]$ gdb bin/postgres
>> data/qemu_postgres_20230911-140628_65620.core
>> Core was generated by `postgres: edb postgres [local] SELECT '.
>> Program terminated with signal SIGSEGV, Segmentation fault.
>> #0 0x00000000010a8366 in heap_compute_data_size
>> (tupleDesc=tupleDesc(at)entry=0x1ba3d10, values=values(at)entry=0x1ba4168,
>> isnull=isnull(at)entry=0x1ba41a8) at heaptuple.c:227
>> 227 VARATT_CAN_MAKE_SHORT(DatumGetPointer(val)))
>> [Current thread is 1 (LWP 65597)]
>> Missing separate debuginfos, use: dnf debuginfo-install
>> glibc-2.34-60.el9.s390x libcap-2.48-8.el9.s390x
>> libedit-3.1-37.20210216cvs.el9.s390x libffi-3.4.2-7.el9.s390x
>> libgcc-11.3.1-4.3.el9.alma.s390x libgcrypt-1.10.0-10.el9_2.s390x
>> libgpg-error-1.42-5.el9.s390x libstdc++-11.3.1-4.3.el9.alma.s390x
>> libxml2-2.9.13-3.el9_2.1.s390x libzstd-1.5.1-2.el9.s390x
>> llvm-libs-15.0.7-1.el9.s390x lz4-libs-1.9.3-5.el9.s390x
>> ncurses-libs-6.2-8.20210508.el9.s390x openssl-libs-3.0.7-17.el9_2.s390x
>> systemd-libs-252-14.el9_2.3.s390x xz-libs-5.2.5-8.el9_0.s390x
>> (gdb) bt
>> #0 0x00000000010a8366 in heap_compute_data_size
>> (tupleDesc=tupleDesc(at)entry=0x1ba3d10, values=values(at)entry=0x1ba4168,
>> isnull=isnull(at)entry=0x1ba41a8) at heaptuple.c:227
>> #1 0x00000000010a9bb0 in heap_form_minimal_tuple
>> (tupleDescriptor=0x1ba3d10, values=0x1ba4168, isnull=0x1ba41a8) at
>> heaptuple.c:1484
>> #2 0x00000000016553fa in ExecCopySlotMinimalTuple (slot=<optimized out>)
>> at ../../../../src/include/executor/tuptable.h:472
>> #3 tuplesort_puttupleslot (state=state(at)entry=0x1be4d18, slot=slot(at)entry=0x1ba4120)
>> at tuplesortvariants.c:610
>> #4 0x00000000012dc0e0 in ExecIncrementalSort (pstate=0x1acb4d8) at
>> nodeIncrementalSort.c:716
>> #5 0x00000000012b32c6 in ExecProcNode (node=0x1acb4d8) at
>> ../../../src/include/executor/executor.h:273
>> #6 ExecutePlan (execute_once=<optimized out>, dest=0x1ade698,
>> direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>,
>> operation=CMD_SELECT, use_parallel_mode=<optimized out>,
>> planstate=0x1acb4d8, estate=0x1acb258) at execMain.c:1670
>> #7 standard_ExecutorRun (queryDesc=0x19ad338, direction=<optimized out>,
>> count=0, execute_once=<optimized out>) at execMain.c:365
>> #8 0x00000000014a6ae2 in PortalRunSelect (portal=portal(at)entry=0x1a63558,
>> forward=forward(at)entry=true, count=0, count(at)entry=9223372036854775807,
>> dest=dest(at)entry=0x1ade698) at pquery.c:924
>> #9 0x00000000014a84e0 in PortalRun (portal=portal(at)entry=0x1a63558,
>> count=count(at)entry=9223372036854775807, isTopLevel=isTopLevel(at)entry=true,
>> run_once=run_once(at)entry=true, dest=dest(at)entry=0x1ade698,
>> altdest=0x1ade698, qc=0x40007ff7b0) at pquery.c:768
>> #10 0x00000000014a3c1c in exec_simple_query (
>> query_string=0x19ea0e8 "SELECT * FROM rm32044_t1 LEFT JOIN rm32044_t2
>> ON rm32044_t1.pkey = rm32044_t2.pkey, rm32044_t3 LEFT JOIN rm32044_t4 ON
>> rm32044_t3.pkey = rm32044_t4.pkey order by rm32044_t1.pkey,label,hidden;")
>> at postgres.c:1274
>> #11 0x00000000014a57aa in PostgresMain (dbname=<optimized out>,
>> username=<optimized out>) at postgres.c:4637
>> #12 0x00000000013fdaf6 in BackendRun (port=0x1a132c0, port=0x1a132c0) at
>> postmaster.c:4464
>> #13 BackendStartup (port=0x1a132c0) at postmaster.c:4192
>> #14 ServerLoop () at postmaster.c:1782
>> #15 0x00000000013fec34 in PostmasterMain (argc=argc(at)entry=3,
>> argv=argv(at)entry=0x19a59a0) at postmaster.c:1466
>> #16 0x0000000001096faa in main (argc=<optimized out>, argv=0x19a59a0) at
>> main.c:198
>>
>> (gdb) p val
>> $1 = 0
>> ```
>>
>> Does anybody have any idea about this?
>>
>> --
>> --
>>
>> Thanks & Regards,
>> Suraj kharage,
>>
>>
>>
>> edbpostgres.com
>>
>
>
> --
> --
>
> Thanks & Regards,
> Suraj kharage,
>
>
>
> edbpostgres.com
>

--
--

Thanks & Regards,
Suraj kharage,

edbpostgres.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2023-10-09 02:55:25 Re: Does anyone ever use OPTIMIZER_DEBUG?
Previous Message Noah Misch 2023-10-09 02:25:29 Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only