From: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Michael Harris <harmic(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: FileFallocate misbehaving on XFS |
Date: | 2024-12-16 13:45:37 |
Message-ID: | CAKZiRmz4uXYDUHeraNZOaEFeZqidRPEROZUnnRGbFnMx7f2u0Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Dec 12, 2024 at 12:50 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> Hi,
>
> FWIW, I tried fairly hard to reproduce this.
>
Same, but without PG and also without much success. I've also tried to push
the AGs (with just one or two AGs created via mkfs) to contain only small
size extents (by creating hundreds of thousands of 8kb files) then deleting
some modulo and then try couple of bigger fallocate/writes to see if that
would blow up on original CentOS 7.9 / 3.10.x kernel, but no - it did not
blow up. It only failed when df -h was exactly 100% in multiple scenarios
like that (and yes it added little space out of blue sometimes too). So my
take is something related to state (having fd open) and concurrency.
Interesting thing that I've observed is that the per directory AG affinity
for big directories (think $PGDATA) is lost when AG is full and then
extents are allocated from different AGs (one can use xfs_bmap -vv to see
allocated AG affinity for directory VS files there)
An extended cycle of 80 backends copying into relations and occasionally
> truncating them (to simulate the partitions being dropped and new ones
> created). For this I ran a 4TB filesystem very close to fully filled
> (peaking
> at 99.998 % full).
>
I could only think of the question: how many files were involved there ?
Maybe it is some kind of race between other (or the same) backends
frequently churning their fdcache's with open()/close() [defeating
speculative preallocation] -> XFS ending up fragmented and only then
posix_fallocate() having issues for larger allocations (>> 8kB)? My take is
if we send N io write vectors then this seems to be handled fine, but when
we throw one big fallocate -- it is not -- so maybe the posix_fallocate()
was in the process of finding space while some other activities happened to
that inode -- like close() -- but then it seems it doesn't match the
pg_upgrade scenario.
Well IMHO we are stuck till Michael provides some more data (patch outcome,
bpf and maybe other hints and tests).
-J.
From | Date | Subject | |
---|---|---|---|
Next Message | Heikki Linnakangas | 2024-12-16 14:01:04 | Re: Visibility bug with prepared transaction with subtransactions on standby |
Previous Message | Jelte Fennema-Nio | 2024-12-16 13:37:48 | Improving default column names/aliases of subscript text expressions |