Re: FileFallocate misbehaving on XFS

From: Michael Harris <harmic(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: FileFallocate misbehaving on XFS
Date: 2024-12-19 06:47:13
Message-ID: CADofcAXMX3OuPfbOU98v+nqGRxVWyUB+KrLs3LhPojgxTAntog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

I finally managed to get the patched version installed in a production
database where the error is occurring very regularly.

Here is a sample of the output:

2024-12-19 01:08:50 CET [2533222]: LOG: mdzeroextend FileFallocate
failing with ENOSPC: free space for filesystem containing
"pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" f_blocks:
2683831808, f_bfree: 205006167, f_bavail: 205006167 f_files:
1073741376, f_ffree: 1069933796
2024-12-19 01:08:50 CET [2533222]: ERROR: could not extend file
"pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" by 13 blocks,
from 110869 to 110882, using FileFallocate(): No space left on device
2024-12-19 01:08:51 CET [2533246]: LOG: mdzeroextend FileFallocate
failing with ENOSPC: free space for filesystem containing
"pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" f_blocks:
2683831808, f_bfree: 205004945, f_bavail: 205004945 f_files:
1073741376, f_ffree: 1069933796
2024-12-19 01:08:51 CET [2533246]: ERROR: could not extend file
"pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" by 14 blocks,
from 110965 to 110979, using FileFallocate(): No space left on device
2024-12-19 01:08:59 CET [2531320]: LOG: mdzeroextend FileFallocate
failing with ENOSPC: free space for filesystem containing
"pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" f_blocks:
2683831808, f_bfree: 204980672, f_bavail: 204980672 f_files:
1073741376, f_ffree: 1069933795
2024-12-19 01:08:59 CET [2531320]: ERROR: could not extend file
"pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" by 14 blocks,
from 111745 to 111759, using FileFallocate(): No space left on device
2024-12-19 01:09:01 CET [2531331]: LOG: mdzeroextend FileFallocate
failing with ENOSPC: free space for filesystem containing
"pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" f_blocks:
2683831808, f_bfree: 204970783, f_bavail: 204970783 f_files:
1073741376, f_ffree: 1069933795
2024-12-19 01:09:01 CET [2531331]: ERROR: could not extend file
"pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" by 12 blocks,
from 112045 to 112057, using FileFallocate(): No space left on device

I have attached a file containing all the errors I collected. The
error is happening pretty regularly - over 400 times in a ~6 hour
period. The number of blocks being extended varies from ~9 to ~15, and
the statfs result shows plenty of available space & inodes at the
time. The errors do seem to come in bursts.

This is a different system to those I previously provided logs from.
It is also running RHEL8 with a similar configuration to the other
system.

I have so far not installed the bpftrace that Jakub suggested before -
as I say this is a production machine and I am wary of triggering a
kernel panic or worse (even though it seems like the risk for that
would be low?). While a kernel stack trace would no doubt be helpful
to the XFS developers, from a postgres point of view, would that be
likely to help us decide what to do about this?

Cheers
Mike

On Tue, 17 Dec 2024 at 10:23, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Mon, Dec 16, 2024 at 12:52 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I don't see what we gain by requiring guesswork (what does allocating vs
> > zeroing mean, zeroing also allocates disk space after all) to interpret the
> > main error message. My experience is that it's often harder to get the DETAIL
> > than the actual error message (grepping becomes harder due to separate line,
> > terse verbosity is commonly used).
>
> I feel like the normal way that we do this is basically:
>
> could not {name of system call} file "\%s\": %m
>
> e.g.
>
> could not read file \"%s\": %m
>
> I don't know why we should do anything else in this type of case.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com

Attachment Content-Type Size
rhel8_fallocate_extended.log application/octet-stream 191.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2024-12-19 06:48:05 Re: New "single" COPY format
Previous Message David Rowley 2024-12-19 06:33:20 Re: Converting SetOp to read its two inputs separately