From: | Justin Pryzby <pryzby(at)telsasoft(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: pg13.2: invalid memory alloc request size NNNN |
Date: | 2021-02-12 18:10:52 |
Message-ID: | 20210212181052.GH1793@telsasoft.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Feb 12, 2021 at 06:44:54PM +0100, Tomas Vondra wrote:
> > (gdb) p len
> > $1 = -4
> >
> > This VM had some issue early today and I killed the VM, causing PG to execute
> > recovery. I'm tentatively blaming that on zfs, so this could conceivably be a
> > data error (although recovery supposedly would have resolved it). I just
> > checked and data_checksums=off.
>
> This seems very much like a corrupted varlena header - length (-4) is
> clearly bogus, and it's what triggers the problem, because that's what wraps
> around to 18446744073709551613 (which is 0xFFFFFFFFFFFFFFFD).
>
> This has to be a value stored in a table, not some intermediate value
> created during execution. So I don't think the exact query matters. Can you
> try doing something like pg_dump, which has to detoast everything?
Right, COPY fails and VACUUM FULL crashes.
message | invalid memory alloc request size 18446744073709551613
query | COPY child.tt TO '/dev/null';
> The question is whether this is due to the VM getting killed in some strange
> way (what VM system is this, how is the storage mounted?) or whether the
> recovery is borked and failed to do the right thing.
This is qemu/kvm, with block storage:
<driver name='qemu' type='raw' cache='none' io='native'/>
<source dev='/dev/data/postgres'/>
And then more block devices for ZFS vdevs:
<driver name='qemu' type='raw' cache='none' io='native'/>
<source dev='/dev/data/zfs2'/>
...
Those are LVM volumes (I know that ZFS/LVM is discouraged).
$ zpool list -v
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
zfs 762G 577G 185G - - 71% 75% 1.00x ONLINE -
vdj 127G 92.7G 34.3G - - 64% 73.0% - ONLINE
vdd 127G 95.6G 31.4G - - 74% 75.2% - ONLINE
vdf 127G 96.0G 31.0G - - 75% 75.6% - ONLINE
vdg 127G 95.8G 31.2G - - 74% 75.5% - ONLINE
vdh 127G 95.5G 31.5G - - 74% 75.2% - ONLINE
vdi 128G 102G 25.7G - - 71% 79.9% - ONLINE
This is recently upgraded to ZFS 2.0.0, and then to 2.0.1:
Jan 21 09:33:26 Installed: zfs-dkms-2.0.1-1.el7.noarch
Dec 23 08:41:21 Installed: zfs-dkms-2.0.0-1.el7.noarch
The VM has gotten "wedged" and I've had to kill it a few times in the last 24h
(needless to say this is not normal). That part seems like a kernel issue and
not postgres problem. It's unclear if that's due to me trying to tickle the
postgres ERROR. It's the latest centos7 kernel: 3.10.0-1160.15.2.el7.x86_64
--
Justin
From | Date | Subject | |
---|---|---|---|
Next Message | Álvaro Hernández | 2021-02-12 18:26:12 | PostgreSQL <-> Babelfish integration |
Previous Message | Tom Lane | 2021-02-12 17:58:22 | Re: Trigger execution role |