Re: pg13.2: invalid memory alloc request size NNNN

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: pg13.2: invalid memory alloc request size NNNN
Date: 2021-02-12 18:10:52
Message-ID: 20210212181052.GH1793@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 12, 2021 at 06:44:54PM +0100, Tomas Vondra wrote:
> > (gdb) p len
> > $1 = -4
> >
> > This VM had some issue early today and I killed the VM, causing PG to execute
> > recovery. I'm tentatively blaming that on zfs, so this could conceivably be a
> > data error (although recovery supposedly would have resolved it). I just
> > checked and data_checksums=off.
>
> This seems very much like a corrupted varlena header - length (-4) is
> clearly bogus, and it's what triggers the problem, because that's what wraps
> around to 18446744073709551613 (which is 0xFFFFFFFFFFFFFFFD).
>
> This has to be a value stored in a table, not some intermediate value
> created during execution. So I don't think the exact query matters. Can you
> try doing something like pg_dump, which has to detoast everything?

Right, COPY fails and VACUUM FULL crashes.

message | invalid memory alloc request size 18446744073709551613
query | COPY child.tt TO '/dev/null';

> The question is whether this is due to the VM getting killed in some strange
> way (what VM system is this, how is the storage mounted?) or whether the
> recovery is borked and failed to do the right thing.

This is qemu/kvm, with block storage:
<driver name='qemu' type='raw' cache='none' io='native'/>
<source dev='/dev/data/postgres'/>

And then more block devices for ZFS vdevs:
<driver name='qemu' type='raw' cache='none' io='native'/>
<source dev='/dev/data/zfs2'/>
...

Those are LVM volumes (I know that ZFS/LVM is discouraged).

$ zpool list -v
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
zfs 762G 577G 185G - - 71% 75% 1.00x ONLINE -
vdj 127G 92.7G 34.3G - - 64% 73.0% - ONLINE
vdd 127G 95.6G 31.4G - - 74% 75.2% - ONLINE
vdf 127G 96.0G 31.0G - - 75% 75.6% - ONLINE
vdg 127G 95.8G 31.2G - - 74% 75.5% - ONLINE
vdh 127G 95.5G 31.5G - - 74% 75.2% - ONLINE
vdi 128G 102G 25.7G - - 71% 79.9% - ONLINE

This is recently upgraded to ZFS 2.0.0, and then to 2.0.1:

Jan 21 09:33:26 Installed: zfs-dkms-2.0.1-1.el7.noarch
Dec 23 08:41:21 Installed: zfs-dkms-2.0.0-1.el7.noarch

The VM has gotten "wedged" and I've had to kill it a few times in the last 24h
(needless to say this is not normal). That part seems like a kernel issue and
not postgres problem. It's unclear if that's due to me trying to tickle the
postgres ERROR. It's the latest centos7 kernel: 3.10.0-1160.15.2.el7.x86_64

--
Justin

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Álvaro Hernández 2021-02-12 18:26:12 PostgreSQL <-> Babelfish integration
Previous Message Tom Lane 2021-02-12 17:58:22 Re: Trigger execution role