Re: FileFallocate misbehaving on XFS

From: Michael Harris <harmic(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: FileFallocate misbehaving on XFS
Date: 2024-12-10 06:28:21
Message-ID: CADofcAWphm3uMtXZVCwko15E47HVhksR5YZ2pWhUpEjNz6Hbmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Andres

Following up on the earlier question about OS upgrade paths - all the
cases reported so far are either on RL8 (Kernel 4.18.0) or were
upgraded to RL9 (kernel 5.14.0) and the affected filesystems were
preserved.
In fact the RL9 systems were initially built as Centos 7, and then
when that went EOL they were upgraded to RL9. The process was as I
described - the /var/opt filesystem which contained the database was
preserved, and the root and other OS filesystems were scratched.
The majority of systems where we have this problem are on RL8.

On Tue, 10 Dec 2024 at 11:31, Andres Freund <andres(at)anarazel(dot)de> wrote:
> Are you using any filesystem quotas?

No.

> It'd be useful to get the xfs_info output that Jakub asked for. Perhaps also
> xfs_spaceman -c 'freesp -s' /mountpoint
> xfs_spaceman -c 'health' /mountpoint
> and df.

I gathered this info from one of the systems that is currently on RL9.
This system is relatively small compared to some of the others that
have exhibited this issue, but it is the only one I can access right
now.

# uname -a
Linux 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15
12:04:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

# xfs_info /dev/mapper/ippvg-ipplv
meta-data=/dev/mapper/ippvg-ipplv isize=512 agcount=4, agsize=262471424 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0, rmapbt=0
= reflink=0 bigtime=0 inobtcount=0 nrext64=0
data = bsize=4096 blocks=1049885696, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=512639, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

# for agno in `seq 0 3`; do xfs_spaceman -c "freesp -s -a $agno" /var/opt; done
from to extents blocks pct
1 1 37502 37502 0.15
2 3 62647 148377 0.59
4 7 87793 465950 1.85
8 15 135529 1527172 6.08
16 31 184811 3937459 15.67
32 63 165979 7330339 29.16
64 127 101674 8705691 34.64
128 255 15123 2674030 10.64
256 511 973 307655 1.22
total free extents 792031
total free blocks 25134175
average free extent size 31.7338
from to extents blocks pct
1 1 43895 43895 0.22
2 3 59312 141693 0.70
4 7 83406 443964 2.20
8 15 120804 1362108 6.75
16 31 133140 2824317 14.00
32 63 118619 5188474 25.71
64 127 77960 6751764 33.46
128 255 16383 2876626 14.26
256 511 1763 546506 2.71
total free extents 655282
total free blocks 20179347
average free extent size 30.7949
from to extents blocks pct
1 1 72034 72034 0.26
2 3 98158 232135 0.83
4 7 126228 666187 2.38
8 15 169602 1893007 6.77
16 31 180286 3818527 13.65
32 63 164529 7276833 26.01
64 127 109687 9505160 33.97
128 255 22113 3921162 14.02
256 511 1901 592052 2.12
total free extents 944538
total free blocks 27977097
average free extent size 29.6199
from to extents blocks pct
1 1 51462 51462 0.21
2 3 98993 233204 0.93
4 7 131578 697655 2.79
8 15 178151 1993062 7.97
16 31 175718 3680535 14.72
32 63 145310 6372468 25.48
64 127 89518 7749021 30.99
128 255 18926 3415768 13.66
256 511 2640 813586 3.25
total free extents 892296
total free blocks 25006761
average free extent size 28.0252

# xfs_spaceman -c 'health' /var/opt
Health status has not been collected for this filesystem.

> What kind of storage is this on?

As mentioned, there are quite a few systems in different sites, so a
number of different storage solutions in use, some with directly
attached disks, others with some SAN solutions.
The instance I got the printout above from is a VM, but in the other
site they are all bare metal.

> Was the filesystem ever grown from a smaller size?

I can't say for sure that none of them were, but given the number of
different systems that have this issue I am confident that would not
be a common factor.

> Have you checked the filesystem's internal consistency? I.e. something like
> xfs_repair -n /dev/nvme2n1. It does require the filesystem to be read-only or
> unmounted though. But corrupted filesystem datastructures certainly could
> cause spurious ENOSPC.

I executed this on the same system as the above prints came from. It
did not report any issues.

> Are you using pg_upgrade -j?

Yes, we use -j `nproc`

> I assume the file that actually errors out changes over time? It's always
> fallocate() that fails?

Yes, correct, on both counts.

> Can you tell us anything about the workload / data? Lots of tiny tables, lots
> of big tables, write heavy, ...?

It is a write heavy application which stores mostly time series data.

The time series data is partitioned by time; the application writes
constantly into the 'current' partition, and data is expired by
removing the oldest partition. Most of the data is written once and
not updated.

There are quite a lot of these partitioned tables (in the 1000's or
10000's) depending on how the application is configured. Individual
partitions range in size from a few MB to 10s of GB.

Cheers
Mike.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2024-12-10 06:30:41 Re: sslinfo extension - add notbefore and notafter timestamps
Previous Message Michael Paquier 2024-12-10 06:27:29 Re: long-standing data loss bug in initial sync of logical replication