Quick Links

Re: FileFallocate misbehaving on XFS

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc:	Michael Harris <harmic(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: FileFallocate misbehaving on XFS
Date:	2024-12-11 23:50:25
Message-ID:	nq4ayqhjmipxahpjtj6jqog3hlk5mfztpvvax62rrmpjjlblrt@42gcpw2cldhv
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

FWIW, I tried fairly hard to reproduce this.

An extended cycle of 80 backends copying into relations and occasionally
truncating them (to simulate the partitions being dropped and new ones
created). For this I ran a 4TB filesystem very close to fully filled (peaking
at 99.998 % full).

I did not see any ENOSPC errors unless the filesystem really was full at that
time. To check that, I made mdzeroextend() do a statfs() when encountering
ENOSPC, printed statfs.f_blocks and made that case PANIC.

What I do see is that after - intentionally - hitting an out-of-disk-space
error, the available disk space would occasionally increase a small amount
after a few seconds. Regardless of whether using the fallocate and
non-fallocate path.

From what I can tell this small increase in free space has a few reasons:

- Checkpointer might not have gotten around to unlinking files, keeping the
inode alive.

- Occasionally bgwriter or a backend would have relation segments that were
unlinked open, so the inode (not the actual file space, because the segment
to prevent that) could not yet be removed from the filesystem

- It looks like xfs does some small amount of work to reclaim space in the
background. Which makes sense, otherwise each unlink would have to be a
flush to disk.

But that's not in any way enough amount of space to explain what you're
seeing. The most I've were 6MB, when ramping up the truncation frequency a
lot.

Of course this was on a newer kernel, not on RHEL / RL 8/9.

Just to make sure - you're absolutely certain that you actually have space at
the time of the errors? E.g. a checkpoint could free up a lot of WAL after a
checkpoint that's soon after an ENOSPC, due to removing now-unneeded WAL
files. That can be 100s of gigabytes.

If I were to provide you with a patch that showed the amount of free disk
space at the time of an error, the size of the relation etc, could you
reproduce the issue with it applied? Or is that unrealistic?

On 2024-12-11 13:05:21 +0100, Jakub Wartak wrote:
> - one AG with extreme low extent sizes compared to the others AGs (I bet
> that 2->3 22.73% below means small 8192b pg files in $PGDATA, but there are
> no large extents in that AG)
> from to extents blocks pct
> 1 1 4949 4949 0.65
> 2 3 86113 173452 22.73
> 4 7 19399 94558 12.39
> 8 15 23233 248602 32.58
> 16 31 12425 241421 31.64
> total free extents 146119
> total free blocks 762982
> average free extent size 5.22165 (!)

Note that this does not mean that all extents in the AG are that small, just
that the *free* extents are of that size.

I think this might primarily be because this AG has the smallest amount of
free blocks (2.9GB). However, the fact that it *does* have less, could be
interesting. It might be the AG associated with the directory for the busiest
database or such.

The next least-space AG is:

from to extents blocks pct
1 1 1021 1021 0.10
2 3 48748 98255 10.06
4 7 9840 47038 4.81
8 15 13648 146779 15.02
16 31 15818 323022 33.06
32 63 584 27932 2.86
64 127 147 14286 1.46
128 255 253 49047 5.02
256 511 229 87173 8.92
512 1023 139 102456 10.49
1024 2047 51 72506 7.42
2048 4095 3 7422 0.76
total free extents 90481
total free blocks 976937

It seems plausible it'd would look similar if more of the free blocks were
used.

> - we have logic of `extend_by_pages += extend_by_pages * waitcount;` capped
> up to 64 pg blocks maximum (and that's higher than the above)
> - but the fails where observed also using pg_upgrade --link -j/pg_restore
> -j (also concurrent posix_fallocate() to many independent files sharing the
> same AG, but that's 1 backend:1 file so no contention for waitcount in
> RelationAddBlocks())

We also extend by more than one page, even without concurrency, if
bulk-insertion is used, and i think we do use that for
e.g. pg_attribute. Which is actually the table where pg_restore encountered
the issue:

pg_restore: error: could not execute query: ERROR: could not extend
file "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with
FileFallocate(): No space left on device

1249 is the initial relfilenode for pg_attribute.

There could also be some parallelism leading to bulk extension, due to the
parallel restore. I don't remember which commands pg_restore actually executes
in parallel.

Greetings,

Andres Freund

In response to

Re: FileFallocate misbehaving on XFS at 2024-12-11 12:05:21 from Jakub Wartak

Responses

Re: FileFallocate misbehaving on XFS at 2024-12-12 03:14:20 from Michael Harris
Re: FileFallocate misbehaving on XFS at 2024-12-16 13:45:37 from Jakub Wartak

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Jeff Davis	2024-12-11 23:52:44	Unicode full case mapping: PG_UNICODE_FAST, and standard-compliant UCS_BASIC
Previous Message	Thomas Munro	2024-12-11 22:43:27	Re: connection establishment versus parallel workers