From: | Michael Harris <harmic(at)gmail(dot)com> |
---|---|
To: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: FileFallocate misbehaving on XFS |
Date: | 2024-12-11 02:59:52 |
Message-ID: | CADofcAXLq1mC3N5CdCAoZdD_HO1NVdYNsUAQVU_G4seP1XqbuQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi Jakub
On Tue, 10 Dec 2024 at 22:36, Jakub Wartak
<jakub(dot)wartak(at)enterprisedb(dot)com> wrote:
> Yay, reflink=0, that's pretty old fs ?!
This particular filesystem was created on Centos 7, and retained when
the system was upgraded to RL9. So yes probably pretty old!
> Could you get us maybe those below commands too? (or from any other directory exhibiting such errors)
>
> stat pg_tblspc/16401/PG_16_202307071/17643/
> ls -1 pg_tblspc/16401/PG_16_202307071/17643/ | wc -l
> time ls -1 pg_tblspc/16401/PG_16_202307071/17643/ | wc -l # to assess timing of getdents() call as that may something about that directory indirectly
# stat pg_tblspc/16402/PG_16_202307071/49163/
File: pg_tblspc/16402/PG_16_202307071/49163/
Size: 5177344 Blocks: 14880 IO Block: 4096 directory
Device: fd02h/64770d Inode: 4299946593 Links: 2
Access: (0700/drwx------) Uid: ( 26/postgres) Gid: ( 26/postgres)
Access: 2024-12-11 09:39:42.467802419 +0900
Modify: 2024-12-11 09:51:19.813948673 +0900
Change: 2024-12-11 09:51:19.813948673 +0900
Birth: 2024-11-25 17:37:11.812374672 +0900
# time ls -1 pg_tblspc/16402/PG_16_202307071/49163/ | wc -l
179000
real 0m0.474s
user 0m0.439s
sys 0m0.038s
> 3. Maybe somehow there is a bigger interaction between posix_fallocate() and delayed XFS's dynamic speculative preallocation from many processes all writing into different partitions ? Maybe try "allocsize=1m" mount option for that /fs and see if that helps. I'm going to speculate about XFS speculative :) pre allocations, but if we have fdcache and are *not* closing fds, how XFS might know to abort its own speculation about streaming write ? (multiply that up to potentially the number of opened fds to get an avalanche of "preallocations").
I will try to organize that. They are production systems so it might
take some time.
> 4. You can also try compiling with patch from Alvaro from [2] "0001-Add-some-debugging-around-mdzeroextend.patch", so we might end up having more clarity in offsets involved. If not then you could use 'strace -e fallocate -p <pid>' to get the exact syscall.
I'll take a look at Alvaro's patch. strace sounds good, but how to
arrange to start it on the correct PG backends? There will be a
large-ish number of PG backends going at a time, only some of which
are performing imports, and they will be coming and going every so
often as the ETL application scales up and down with the load.
> 5. Another idea could be catching the kernel side stacktrace of fallocate() when it is hitting ENOSPC. E.g. with XFS fs and attached bpftrace eBPF tracer I could get the source of the problem in my artificial reproducer, e.g
OK, I will look into that also.
Cheers
Mike
From | Date | Subject | |
---|---|---|---|
Next Message | Yuya Watari | 2024-12-11 03:16:24 | Re: [PoC] Reducing planning time when tables have many partitions |
Previous Message | Dilip Kumar | 2024-12-11 02:51:27 | Re: Skip collecting decoded changes of already-aborted transactions |