Re: FileFallocate misbehaving on XFS

From: Andres Freund <andres(at)anarazel(dot)de>
To: Michael Harris <harmic(at)gmail(dot)com>
Cc: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: FileFallocate misbehaving on XFS
Date: 2024-12-12 21:38:02
Message-ID: vgh2m75nh6s53diujmooue2y6eon3jdriildwqnnoa4okmizmx@3pcg7pa5qubx
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2024-12-12 14:14:20 +1100, Michael Harris wrote:
> On Thu, 12 Dec 2024 at 10:50, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Just to make sure - you're absolutely certain that you actually have space at
> > the time of the errors?
>
> As sure as I can be. The RHEL8 system that I took prints from
> yesterday has > 1.5TB free. I can't see it varying by that much.

That does seem unlikely, but it'd probably still be worth monitoring by how
much it varies.

> It does look as though the system needs to be quite full to provoke
> this problem. The systems I have looked at so far have >90% full
> filesystems.
>
> Another interesting snippet: the application has a number of ETL
> workers going at once. The actual number varies depending on a number
> of factors but might be somewhere from 10 - 150. Each worker will have
> a single postgres backend that they are feeding data to.

Are they all inserting into distinct tables/partitions or into shared tables?

> At the time of the error, it is not the case that all ETL workers
> strike it at once - it looks like a lot of the time only a single
> worker is affected, or at most a handful of workers. I can't see for
> sure what the other workers were doing at the time, but I would expect
> they were all importing data as well.

When you say that they're not "all striking it at once", do you mean that some
of them aren't interacting with the database at the time, or that they're not
erroring out?

> > If I were to provide you with a patch that showed the amount of free disk
> > space at the time of an error, the size of the relation etc, could you
> > reproduce the issue with it applied? Or is that unrealistic?
>
> I have not been able to reproduce it on demand, and so far it has only
> happened in production systems.
>
> As long as the patch doesn't degrade normal performance it should be
> possible to deploy it to one of the systems that is regularly
> reporting the error, although it might take a while to get approval to
> do that.

Cool. The patch only has an effect in the branches reporting out-of-space
errors, so there's no overhead during normal operation. And the additional
detail doesn't have much overhead in the error case either.

I attached separate patches for 16, 17 and master, as there's some minor
conflicts between the version.

Greetings,

Andres Freund

Attachment Content-Type Size
16-0001-md-Report-more-detail-when-encountering-ENOSPC-durin.patch text/x-diff 5.7 KB
17-0001-md-Report-more-detail-when-encountering-ENOSPC-durin.patch text/x-diff 5.7 KB
HEAD-0001-md-Report-more-detail-when-encountering-ENOSPC-durin.patch text/x-diff 5.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthias van de Meent 2024-12-12 21:49:35 Re: Crash: invalid DSA memory alloc request
Previous Message Masahiko Sawada 2024-12-12 21:30:29 Re: Skip collecting decoded changes of already-aborted transactions