Re: FileFallocate misbehaving on XFS

From: Andres Freund <andres(at)anarazel(dot)de>
To: Michael Harris <harmic(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: FileFallocate misbehaving on XFS
Date: 2024-12-10 00:31:18
Message-ID: 6m3j6rsbngcma45ckox3msfgbn2jjspkqau5bma2pq4l5nolni@2umtkdghgavf
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2024-12-10 10:00:43 +1100, Michael Harris wrote:
> On Mon, 9 Dec 2024 at 21:06, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> > Sounds more like an XFS bug/behavior, so it's not clear to me what we
> > could do about it. I mean, if the filesystem reports bogus out-of-space,
> > is there even something we can do?
>
> I don't disagree that it's most likely an XFS issue. However, XFS is
> pretty widely used - it's the default FS for RHEL & the default in
> SUSE for non-root partitions - so maybe some action should be taken.
>
> Some things we could consider:
>
> - Providing a way to configure PG not to use posix_fallocate at runtime
>
> - Detecting the use of XFS (probably nasty and complex to do in a
> platform independent way) and disable posix_fallocate
>
> - In the case of posix_fallocate failing with ENOSPC, fall back to
> FileZero (worst case that will fail as well, in which case we will
> know that we really are out of space)
>
> - Documenting that XFS might not be a good choice, at least for some
> kernel versions

Pretty unexcited about all of these - XFS is fairly widely used for PG, but
this problem doesn't seem very common. It seems to me that we're missing
something that causes this to only happen in a small subset of cases.

I think the source of this needs to be debugged further before we try to apply
workarounds in postgres.

Are you using any filesystem quotas?

It'd be useful to get the xfs_info output that Jakub asked for. Perhaps also
xfs_spaceman -c 'freesp -s' /mountpoint
xfs_spaceman -c 'health' /mountpoint
and df.

What kind of storage is this on?

Was the filesystem ever grown from a smaller size?

Have you checked the filesystem's internal consistency? I.e. something like
xfs_repair -n /dev/nvme2n1. It does require the filesystem to be read-only or
unmounted though. But corrupted filesystem datastructures certainly could
cause spurious ENOSPC.

> > What is not clear to me is why would this affect pg_upgrade at all. We
> > have the data files split into 1GB segments, and the copy/clone/... goes
> > one by one. So there shouldn't be more than 1GB "extra" space needed.
> > Surely you have more free space on the system?
>
> Yes, that also confused me. It actually fails during the schema
> restore phase - where pg_upgrade calls pg_restore to restore a
> schema-only dump that it takes earlier in the process. At this stage
> it is only trying to restore the schema, not any actual table data.
> Note that we use the --link option to pg_upgrade, so it should not be
> using much space even when the table data is being upgraded.

Are you using pg_upgrade -j?

I'm asking because looking at linux's git tree I found this interesting recent
commit: https://git.kernel.org/linus/94a0333b9212 - but IIUC it'd actually
cause file creation, not fallocate to fail.

> The filesystems have >1TB free space when this has occurred.
>
> It does continue to give this error after the upgrade, at apparently
> random intervals, when data is being loaded into the DB using COPY
> commands, so it might be best not to focus too much on the fact that
> we first encounter it during the upgrade.

I assume the file that actually errors out changes over time? It's always
fallocate() that fails?

Can you tell us anything about the workload / data? Lots of tiny tables, lots
of big tables, write heavy, ...?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2024-12-10 00:54:36 Re: shared-memory based stats collector - v70
Previous Message Matheus Alcantara 2024-12-09 23:39:03 Outdated comment on scram_build_secret