Re: FileFallocate misbehaving on XFS

From: Andres Freund <andres(at)anarazel(dot)de>
To: Michael Harris <harmic(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: FileFallocate misbehaving on XFS
Date: 2024-12-10 16:09:41
Message-ID: ndbw5krcrblckalpdcmucu56mofxe5wiifqu2nbfadyz6yv6t6@mslssgqtztn4
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2024-12-10 17:28:21 +1100, Michael Harris wrote:
> On Tue, 10 Dec 2024 at 11:31, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > It'd be useful to get the xfs_info output that Jakub asked for. Perhaps also
> > xfs_spaceman -c 'freesp -s' /mountpoint
> > xfs_spaceman -c 'health' /mountpoint
> > and df.
>
> I gathered this info from one of the systems that is currently on RL9.
> This system is relatively small compared to some of the others that
> have exhibited this issue, but it is the only one I can access right
> now.

I think it's implied, but I just want to be sure: This was one of the affected
systems?

> # uname -a
> Linux 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15
> 12:04:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
>
> # xfs_info /dev/mapper/ippvg-ipplv
> meta-data=/dev/mapper/ippvg-ipplv isize=512 agcount=4, agsize=262471424 blks
> = sectsz=512 attr=2, projid32bit=1
> = crc=1 finobt=0, sparse=0, rmapbt=0
> = reflink=0 bigtime=0 inobtcount=0 nrext64=0
> data = bsize=4096 blocks=1049885696, imaxpct=5
> = sunit=0 swidth=0 blks
> naming =version 2 bsize=4096 ascii-ci=0, ftype=1
> log =internal log bsize=4096 blocks=512639, version=2
> = sectsz=512 sunit=0 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0

It might be interesting that finobt=0, sparse=0 and nrext64=0. Those all
affect space allocation to some degree and more recently created filesystems
will have them to different values, which could explain why you but not that
many others hit this issue.

Any chance to get df output? I'm mainly curious about the number of used
inodes.

Could you show the mount options that end up being used?
grep /var/opt /proc/mounts

I rather doubt it is, but it'd sure be interesting if inode32 were used.

I assume you have never set XFS options for the PG directory or files within
it? Could you show
xfs_io -r -c lsattr -c stat -c statfs /path/to/directory/with/enospc
?

> # for agno in `seq 0 3`; do xfs_spaceman -c "freesp -s -a $agno" /var/opt; done
> from to extents blocks pct
> 1 1 37502 37502 0.15
> 2 3 62647 148377 0.59
> 4 7 87793 465950 1.85
> 8 15 135529 1527172 6.08
> 16 31 184811 3937459 15.67
> 32 63 165979 7330339 29.16
> 64 127 101674 8705691 34.64
> 128 255 15123 2674030 10.64
> 256 511 973 307655 1.22
> total free extents 792031
> total free blocks 25134175
> average free extent size 31.7338
> from to extents blocks pct
> 1 1 43895 43895 0.22
> 2 3 59312 141693 0.70
> 4 7 83406 443964 2.20
> 8 15 120804 1362108 6.75
> 16 31 133140 2824317 14.00
> 32 63 118619 5188474 25.71
> 64 127 77960 6751764 33.46
> 128 255 16383 2876626 14.26
> 256 511 1763 546506 2.71
> total free extents 655282
> total free blocks 20179347
> average free extent size 30.7949
> from to extents blocks pct
> 1 1 72034 72034 0.26
> 2 3 98158 232135 0.83
> 4 7 126228 666187 2.38
> 8 15 169602 1893007 6.77
> 16 31 180286 3818527 13.65
> 32 63 164529 7276833 26.01
> 64 127 109687 9505160 33.97
> 128 255 22113 3921162 14.02
> 256 511 1901 592052 2.12
> total free extents 944538
> total free blocks 27977097
> average free extent size 29.6199
> from to extents blocks pct
> 1 1 51462 51462 0.21
> 2 3 98993 233204 0.93
> 4 7 131578 697655 2.79
> 8 15 178151 1993062 7.97
> 16 31 175718 3680535 14.72
> 32 63 145310 6372468 25.48
> 64 127 89518 7749021 30.99
> 128 255 18926 3415768 13.66
> 256 511 2640 813586 3.25
> total free extents 892296
> total free blocks 25006761
> average free extent size 28.0252

So there's *some*, but not a lot, of imbalance in AG usage. Of course that's
as of this moment, and as you say below, you expire old partitions on a
regular basis...

My understanding of XFS's space allocation is that by default it continues to
use the same AG for allocations within one directory, until that AG is full.
For a write heavy postgres workload that's of course not optimal, as all
activity will focus on one AG.

I'd try monitoring the per-ag free space over time and see if the the ENOSPC
issue is correlated with one AG getting full. 'freesp' is probably too
expensive for that, but it looks like
xfs_db -r -c agresv /dev/nvme6n1
should work?

Actually that output might be interesting to see, even when you don't hit the
issue.

> > Can you tell us anything about the workload / data? Lots of tiny tables, lots
> > of big tables, write heavy, ...?
>
> It is a write heavy application which stores mostly time series data.
>
> The time series data is partitioned by time; the application writes
> constantly into the 'current' partition, and data is expired by
> removing the oldest partition. Most of the data is written once and
> not updated.
>
> There are quite a lot of these partitioned tables (in the 1000's or
> 10000's) depending on how the application is configured. Individual
> partitions range in size from a few MB to 10s of GB.

So there are 1000s of tables that are concurrently being appended, but only
into one partition each. That does make it plausible that there's a
significant amount of fragmentation. Possibly transient due to the expiration.

How many partitions are there for each of the tables? Mainly wondering because
of the number of inodes being used.

Are all of the active tables within one database? That could be relevant due
to per-directory behaviour of free space allocation.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2024-12-10 16:34:15 Re: FileFallocate misbehaving on XFS
Previous Message Peter Eisentraut 2024-12-10 15:25:12 Re: Support for unsigned integer types