Re: Strange issue with NFS mounted PGDATA on ugreen NAS

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Kenneth Marshall <ktm(at)rice(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Larry Rosenman <ler(at)lerctr(dot)org>, Pgsql hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Strange issue with NFS mounted PGDATA on ugreen NAS
Date: 2025-01-01 05:21:34
Message-ID: CA+hUKGKo3Q=gjqdvfxn2Pd0RNBd5Xd66hBjiv=5pWXfdWs=jAQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 1, 2025 at 1:20 PM Kenneth Marshall <ktm(at)rice(dot)edu> wrote:
> On Tue, Dec 31, 2024 at 06:58:14PM -0500, Tom Lane wrote:
> > Larry Rosenman <ler(at)lerctr(dot)org> writes:
> > > On 12/31/2024 5:37 pm, Tom Lane wrote:
> > >> Do you know what its underlying file system is?
> >
> > > btrfs

> Maybe there are some btrfs or nfs options that can be used to mitigate
> this effect. Otherwise, a bug report to Debian would be in order, I guess.

Mount option readdirsize on the client side should hide the problem up
to some size you choose, but you can't set it large enough for high
numbers of relations/forks/segments.

Guessing what is happening here: I suspect BTRFS might have
positional offsets 1, 2, 3, ... for directory entries' d_off (the
value visible in struct direct, used for telldir(), seekdir(), and
NFS's behind-the-curtain paging scheme), and they might slide when you
unlink stuff. Perhaps not immediately, but when the directory fd is
closed on the NFS server (nearly immediately I guess given the
stateless nature of NFS, it doesn't matter that the client has its
directory fd open). That would explain how you finished up with so
many missed files.

I think XFS's d_off points to the next entry in a btree leaf page
scan, which sounds a lot more stable... until someone else unlinks the
next item underneath you and/or the system decides to compact stuff,
who knows... And other systems have other schemes based on hashes or
raw offsets, with different degrees of stability and anomalies (cf
ELOOP for hash collisions).

NFS is at least supposed to tell the client that its cookie has been
invalidated with a cookie-invalidation-cookie called cookieverf. But
there isn't any specified way to recover. FreeBSD's client looks like
it might try to, but I'm not sure if that Linux's server even
implements it.

Anyway, I'll write a patch to change rmtree() to buffer the names in
memory. In theory there could be hundreds of gigabytes of pathnames,
so perhaps I should do it in batches; I'll look into that.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2025-01-01 05:36:15 Re: Conflict detection for update_deleted in logical replication
Previous Message Andres Freund 2025-01-01 04:03:33 Re: AIO v2.2