Fwd: Re: A new look at old NFS readdir() problems?

From: Larry Rosenman <ler(at)lerctr(dot)org>
To: Pgsql hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <tmunro(at)freebsd(dot)org>
Subject: Fwd: Re: A new look at old NFS readdir() problems?
Date: 2025-01-02 20:26:28
Message-ID: 04cf05d053e9320012b32370e228fac4@lerctr.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

@Tom Lane: This is what Rick Macklem (NFS dev on FreeBSD) has to say on
my issue.

-------- Original Message --------
Subject: Re: A new look at old NFS readdir() problems?
Date: 01/02/2025 10:08 am
From: Rick Macklem <rick(dot)macklem(at)gmail(dot)com>
To: Thomas Munro <tmunro(at)freebsd(dot)org>
Cc: Rick Macklem <rmacklem(at)freebsd(dot)org>, Larry Rosenman <ler(at)lerctr(dot)org>

On Thu, Jan 2, 2025 at 2:50 AM Thomas Munro <tmunro(at)freebsd(dot)org> wrote:
>
> CAUTION: This email originated from outside of the University of
> Guelph. Do not click links or open attachments unless you recognize the
> sender and know the content is safe. If in doubt, forward suspicious
> emails to IThelp(at)uoguelph(dot)ca(dot)
>
>
> Hi Rick
> CC: ler
>
> I hope you don't mind me reaching out directly, I just didn't really
> want to spam existing bug reports without sufficient understanding to
> actually help yet... but I figured I should get in touch and see if
> you have any clues or words of warning, since you've worked on so much
> of the NFS code. I'm a minor FBSD contributor and interested in file
> systems, but not knowledgeable about NFS; I run into/debug/report a
> lot of file system bugs on a lot of systems in my day job on
> databases. I'm interested to see if I can help with this problem.
> Existing ancient report and interesting email:
>
> https://lists.freebsd.org/pipermail/freebsd-fs/2014-October/020155.html
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=57696
>
> What we ran into is not the "bad cookie" state, which doesn't really
> seem to be recoverable in general, from what I understand (though the
> FreeBSD code apparently would try, huh). It's a simple case where the
> NFS client requests a whole directory with a large READDIR request,
> and then tries to unlink all the files in a traditional
> while-readdir()-unlink() loop that works on other systems.
In general, NFS is not a POSIX compliant file system, due to its
protocol
design. The above is one example. The only "safe" way is to opendir() or
rewinddir() after every removal.

The above usually works (and always worked for UFS long ago) because
the directory offset cookies for subsequent entries in the directory
after
the entry unlinked happened to "still be valid". That is no longer true
for FreeBSD's UFS nor for many other file systems that can be exported.

If the client reads the entire directory in one READDIR, then it is
fine,
since it has no need to the directory offset cookies. However, there is
a limit to how much a single READDIR can do (these days for NFSv4.1/4.2,
it could be raised to just over 1Mbyte, however FreeBSD limits it to 8K
at
the moment).

Another way to work around the problem is to read the entire directory
into the client via READDIRs before starting to do the unlinks.
The opendir()/readdir() code in libc could be hacked to do that,
but I have never tried to push such a patch into FreeBSD.
(It would be limited by how much memory can be malloc()'d, that
is pretty generous compared to even large directorys with 10s of
thousand entries.)

The above is true for all versions of NFS up to NFSv4.2, which is
the current one and unless some future version of NFS does READDIR
differently (I won't live long enough to see this;-), it will always
be the case.

If my comment above was not clear, the following encoding is the "safe"
way to remove all entries in a directory.

do {
dir = opendir("X");
dp = readdir(dir);
if (dp != NULL)
unlink(dp->d_name);
close(dir);
} while (dp != NULL);

In theory, the directory_offset_cookie was supposed to handle this, but
it
has never worked correctly, for a couple of reasons.
1 - RFC1813 (the NFSv3 one) did not describe the cookie verifier
correctly.
It should only change when cookies for extant entries change. The
description
suggested it should change whenever an entry is deleted, since that
cookie
is no longer valid.
2 - #1 only works if directory offset cookies for other entries in the
directory
do not change when an entry is deleted. This used to be the case for
UFS,
but was broken in FreeBSD when a commit many years ago optimized
ufs_readdir() to compress out invalid entries. Doing this changes
the
directory offset cookies every time an entry is deleted at the
beginning
of a directory block.

rick
> On FreeBSD
> it seems to clobber its own directory cache, make extra unnecessary
> READDIR requests, and skip some of the files. Or maybe I have no idea
> what's going on and this is a hopelessly naive question and mission
> :-)
>
> Here's what we learned so far starting from Larry's report:
>
> https://www.postgresql.org/message-id/flat/04f95c3c13d4a9db87b3ac082a9f4877%40lerctr.org
>
> Note that this issue has nothing to do with "bad cookie" errors (I
> doubt the server I'm talking to even implements that -- instead it
> tries to have cookies that are persistent/stable).
>
> Also, while looking into this and initially suspecting cookie
> stability bugs (incorrectly), I checked a bunch of local file systems
> to understand how their cookies work, and I think I found a related
> problem when FreeBSD exports UFS, too. I didn't repro this with NFS
> but it's clearly visible from d_off locally with certain touch, rm
> sequences. First, let me state what I think the cookie should be
> trying to achieve, on a system that doesn't implement "bad cookie" but
> instead wants cookies that are persistent/always valid: if you make a
> series of READDIR requests using the cookie from the final entry of
> the previous response, it should be impossible to miss any entry that
> existed before your first call to readdir(), and impossible to see any
> entry twice. It is left undefined whether entries created after that
> time are visible, since anything else would require unbounded time or
> space via locks or multi-version magic (= isolation problems from
> database-land).
>
> Going back to the early 80s, Sun UFS looks good (based on illumos
> source code) because it doesn't seem to move entries after they are
> created. That must have been the only file system when they invented
> VFS and NFS. Various other systems since have been either complex but
> apparently good (ZFS/ZAP cursors can tolerate up to 2^16 hash
> collisions which I think we can call statistically impossible, XFS
> claims to be completely stable though I didn't understand fully why,
> BTRFS assigns incrementing numbers that will hopefully not wrap, ...),
> or nearly-good-enough-but-ugh (ext4 uses hashes like ZFS but
> apparently fails with ELOOP on hash collisions?). I was expecting
> FreeBSD UFS to be like Sun UFS but I don't think it is! In the UFS
> code since at least 4.3BSD (but apparently not in the Sun version,
> forked before or removed later?), inserting a new entry can compact a
> directory page, which moves the offset of a directory entry lower.
> AFAICS we can't move an entry lower, or we risk skipping it in NFS
> readdir(), and we can't move it higher, or we risk double-reporting it
> in readdir(). Or am I missing something?
>
> Thanks for reading and happy new year,
>
> Thomas Munro

--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: ler(at)lerctr(dot)org
US Mail: 13425 Ranch Road 620 N, Apt 718, Austin, TX 78717-1010

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2025-01-02 20:27:01 Re: magical eref alias names
Previous Message Tom Lane 2025-01-02 20:14:00 Re: Strange issue with NFS mounted PGDATA on ugreen NAS