Re: FileFallocate misbehaving on XFS

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Michael Harris <harmic(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: FileFallocate misbehaving on XFS
Date: 2024-12-11 12:05:21
Message-ID: CAKZiRmzO=ZetYm2xvXJVmmiSeyZJcwC9oHYEwSjsV7ifT4cn=g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 11, 2024 at 4:00 AM Michael Harris <harmic(at)gmail(dot)com> wrote:

> Hi Jakub
>
> On Tue, 10 Dec 2024 at 22:36, Jakub Wartak
> <jakub(dot)wartak(at)enterprisedb(dot)com> wrote:

[..]

>
> > 3. Maybe somehow there is a bigger interaction between posix_fallocate()
> and delayed XFS's dynamic speculative preallocation from many processes all
> writing into different partitions ? Maybe try "allocsize=1m" mount option
> for that /fs and see if that helps. I'm going to speculate about XFS
> speculative :) pre allocations, but if we have fdcache and are *not*
> closing fds, how XFS might know to abort its own speculation about
> streaming write ? (multiply that up to potentially the number of opened fds
> to get an avalanche of "preallocations").
>
> I will try to organize that. They are production systems so it might
> take some time.
>

Cool.

> 4. You can also try compiling with patch from Alvaro from [2]
> "0001-Add-some-debugging-around-mdzeroextend.patch", so we might end up
> having more clarity in offsets involved. If not then you could use 'strace
> -e fallocate -p <pid>' to get the exact syscall.
>
> I'll take a look at Alvaro's patch. strace sounds good, but how to
> arrange to start it on the correct PG backends? There will be a
> large-ish number of PG backends going at a time, only some of which
> are performing imports, and they will be coming and going every so
> often as the ETL application scales up and down with the load.
>

Yes, it sounds like mission impossible. Is there any chance you can get
reproduced down to one or a small number of postgres backends doing the
writes?

>
> > 5. Another idea could be catching the kernel side stacktrace of
> fallocate() when it is hitting ENOSPC. E.g. with XFS fs and attached
> bpftrace eBPF tracer I could get the source of the problem in my artificial
> reproducer, e.g
>
> OK, I will look into that also.
>
>
Hopefully that reveals some more. Somehow ENOSPC UNIX error reporting got
one big pile of errors into 1 error category and that's not helpful at all
(inode/extent/block allocation problems are all squeezed into 1 error)

Anyway, if that helps others here are my notes so far on this thread
including that useful file from subthread, hopefully I've did not
misinterpreted something:

- works in <PG16, but fails with >= PG16 due to posix_fallocate() rather
than multiple separate(but adjacent) iovectors to pg_writev. It launched
only in case of mdzeroextend() with numblocks > 8
- 179k or 414k files in single directory (0.3s - 0.5s just to list those)
- OS/FS upgraded from earlier release
- one AG with extreme low extent sizes compared to the others AGs (I bet
that 2->3 22.73% below means small 8192b pg files in $PGDATA, but there are
no large extents in that AG)
from to extents blocks pct
1 1 4949 4949 0.65
2 3 86113 173452 22.73
4 7 19399 94558 12.39
8 15 23233 248602 32.58
16 31 12425 241421 31.64
total free extents 146119
total free blocks 762982
average free extent size 5.22165 (!)
- note that the max extent size above (31) is very low when compared to the
others AG which have 1024-8192. Therefore it looks there are no contiguous
blocks for request sizes above 31*4096 = 126976 bytes within that AG (??).
- we have logic of `extend_by_pages += extend_by_pages * waitcount;` capped
up to 64 pg blocks maximum (and that's higher than the above)
- but the fails where observed also using pg_upgrade --link -j/pg_restore
-j (also concurrent posix_fallocate() to many independent files sharing the
same AG, but that's 1 backend:1 file so no contention for waitcount in
RelationAddBlocks())
- so maybe it's lots of backends doing independent concurrent
posix_fallocate() that end up somehow coalesced ? Or hypothetically let's
say 16-32 fallocates() hit the same AG initially, maybe it's some form of
concurrency semi race-condition inside XFS where one of fallocate calls
fails to find space in that one AG, but according to [1] it should fallback
to another AGs.
- and there's also additional XFS dynamic speculative preallocation that
might cause space pressure during our normal writes..

Another workaround idea/test: create tablespace on the same XFS fs (but in
a somewhat different directory if possible) and see if it still fails.

-J.

[1] - https://blogs.oracle.com/linux/post/extent-allocation-in-xfs

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ryohei Takahashi (Fujitsu) 2024-12-11 12:18:23 RE: COPY performance on Windows
Previous Message Nishant Sharma 2024-12-11 11:41:02 Re: on_error table, saving error info to a table