From: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com> |
---|---|
To: | Michael Harris <harmic(at)gmail(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: FileFallocate misbehaving on XFS |
Date: | 2024-12-10 11:36:33 |
Message-ID: | CAKZiRmzPnQeofiPNyNBtYvBYOrSB0UX3qGjav5wtGvgZFNRkoQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Dec 10, 2024 at 7:34 AM Michael Harris <harmic(at)gmail(dot)com> wrote:
Hi Michael,
1. Well it doesn't look like XFS AG fragmentation to me (we had a customer
with a huge number of AGs with small space in them) reporting such errors
after upgrading to 16, but not for earlier versions (somehow
posix_fallocate() had to be the culprit).
2.
> # xfs_info /dev/mapper/ippvg-ipplv
> meta-data=/dev/mapper/ippvg-ipplv isize=512 agcount=4,
agsize=262471424 blks
> = sectsz=512 attr=2, projid32bit=1
> = crc=1 finobt=0, sparse=0, rmapbt=0
> = reflink=0 bigtime=0 inobtcount=0
nrext64=0
Yay, reflink=0, that's pretty old fs ?!
> ERROR: could not extend file
"pg_tblspc/16401/PG_16_202307071/17643/1249.1" with FileFallocate(): No
space left on device
2. This indicates it was allocating 1GB for such a table (".1"), on
tablespace that was created more than a year ago. Could you get us maybe
those below commands too? (or from any other directory exhibiting such
errors)
stat pg_tblspc/16401/PG_16_202307071/17643/
ls -1 pg_tblspc/16401/PG_16_202307071/17643/ | wc -l
time ls -1 pg_tblspc/16401/PG_16_202307071/17643/ | wc -l # to assess
timing of getdents() call as that may something about that directory
indirectly
3. Maybe somehow there is a bigger interaction between posix_fallocate()
and delayed XFS's dynamic speculative preallocation from many processes all
writing into different partitions ? Maybe try "allocsize=1m" mount option
for that /fs and see if that helps. I'm going to speculate about XFS
speculative :) pre allocations, but if we have fdcache and are *not*
closing fds, how XFS might know to abort its own speculation about
streaming write ? (multiply that up to potentially the number of opened fds
to get an avalanche of "preallocations").
4. You can also try compiling with patch from Alvaro from [2]
"0001-Add-some-debugging-around-mdzeroextend.patch", so we might end up
having more clarity in offsets involved. If not then you could use 'strace
-e fallocate -p <pid>' to get the exact syscall.
5. Another idea could be catching the kernel side stacktrace of fallocate()
when it is hitting ENOSPC. E.g. with XFS fs and attached bpftrace eBPF
tracer I could get the source of the problem in my artificial reproducer,
e.g
# bpftrace ./track_enospc2.bt # wait for "START" and then start reproducing
on the sess2, but try to minimize the time period, that eBPF might things
really slow
$ dd if=/dev/zero of=/fs/test1 bs=1M count=200
$ fallocate /fs/test -l 30000000
fallocate: fallocate failed: No space left on device
$ df -h /fs
Filesystem Size Used Avail Use% Mounted on
/dev/loop0 236M 217M 20M 92% /fs
# in bpftrace CTRL+C, will get:
@errors[-28, kretprobe:xfs_file_fallocate,
xfs_alloc_file_space+665
xfs_alloc_file_space+665
xfs_file_fallocate+869
vfs_fallocate+319
__x64_sys_fallocate+68
do_syscall_64+130
entry_SYSCALL_64_after_hwframe+118
]: 1
-28 = ENOSPC, xfs_alloc_file_space() was the routine that was root-cause
and shows the full logic behind it. That ABI might be different on Your
side due to kernel variations. It could be enhanced, and it might print too
much (so you need to look for that -28 in the outputs). Possibly if you get
any sensible output from it, you could also involve OS support (because if
posix_fallocate() fails and there's space , then it's pretty odd anyway).
-J.
[1] -
https://www.postgresql.org/message-id/50A117B6.5030300@optionshouse.com
[2] -
https://www.postgresql.org/message-id/202409110955.6njbwzm4ocus%40alvherre.pgsql
Attachment | Content-Type | Size |
---|---|---|
track_enospc2.bt | application/octet-stream | 561 bytes |
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2024-12-10 11:39:39 | Re: Refactoring postmaster's code to cleanup after child exit |
Previous Message | Victor Yegorov | 2024-12-10 10:59:33 | Re: Proposal to add a new URL data type. |