Re: FileFallocate misbehaving on XFS

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Michael Harris <harmic(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: FileFallocate misbehaving on XFS
Date: 2024-12-16 13:45:37
Message-ID: CAKZiRmz4uXYDUHeraNZOaEFeZqidRPEROZUnnRGbFnMx7f2u0Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 12, 2024 at 12:50 AM Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> FWIW, I tried fairly hard to reproduce this.
>

Same, but without PG and also without much success. I've also tried to push
the AGs (with just one or two AGs created via mkfs) to contain only small
size extents (by creating hundreds of thousands of 8kb files) then deleting
some modulo and then try couple of bigger fallocate/writes to see if that
would blow up on original CentOS 7.9 / 3.10.x kernel, but no - it did not
blow up. It only failed when df -h was exactly 100% in multiple scenarios
like that (and yes it added little space out of blue sometimes too). So my
take is something related to state (having fd open) and concurrency.

Interesting thing that I've observed is that the per directory AG affinity
for big directories (think $PGDATA) is lost when AG is full and then
extents are allocated from different AGs (one can use xfs_bmap -vv to see
allocated AG affinity for directory VS files there)

An extended cycle of 80 backends copying into relations and occasionally
> truncating them (to simulate the partitions being dropped and new ones
> created). For this I ran a 4TB filesystem very close to fully filled
> (peaking
> at 99.998 % full).
>

I could only think of the question: how many files were involved there ?
Maybe it is some kind of race between other (or the same) backends
frequently churning their fdcache's with open()/close() [defeating
speculative preallocation] -> XFS ending up fragmented and only then
posix_fallocate() having issues for larger allocations (>> 8kB)? My take is
if we send N io write vectors then this seems to be handled fine, but when
we throw one big fallocate -- it is not -- so maybe the posix_fallocate()
was in the process of finding space while some other activities happened to
that inode -- like close() -- but then it seems it doesn't match the
pg_upgrade scenario.

Well IMHO we are stuck till Michael provides some more data (patch outcome,
bpf and maybe other hints and tests).

-J.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2024-12-16 14:01:04 Re: Visibility bug with prepared transaction with subtransactions on standby
Previous Message Jelte Fennema-Nio 2024-12-16 13:37:48 Improving default column names/aliases of subscript text expressions