From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org, Craig Ringer <craig(dot)ringer(at)2ndquadrant(dot)com> |
Subject: | Re: Some thoughts on NFS |
Date: | 2019-02-19 16:52:11 |
Message-ID: | 20190219165211.gjgqs72ufbtdn3wz@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2019-02-19 20:03:05 +1300, Thomas Munro wrote:
> The first is practical. Running out of diskspace (or quota) is not
> all that rare (much more common that EIO from a dying disk, I'd
> guess), and definitely recoverable by an administrator: just create
> more space. It would be really nice to avoid panicking for an
> *expected* condition.
Well, that's true, but OTOH, we don't even handle that properly on local
filesystems for WAL. And while people complain, it's not *that* common.
> 1. Figure out how to get the ALLOCATE command all the way through the
> stack from PostgreSQL to the remote NFS server, and know for sure that
> it really happened. On the Debian buster Linux 4.18 system I checked,
> fallocate() reports EOPNOTSUPP for fallocate(), and posix_fallocate()
> appears to succeed but it doesn't really do anything at all (though I
> understand that some versions sometimes write zeros to simulate
> allocation, which in this case would be equally useless as it doesn't
> reserve anything on an NFS server). We need the server and NFS client
> and libc to be of the right version and cooperate and tell us that
> they have really truly reserved space, but there isn't currently a way
> as far as I can tell. How can we achieve that, without writing our
> own NFS client?
>
> 2. Deal with the resulting performance suckage. Extending 8kb at a
> time with synchronous network round trips won't fly.
I think I'd just go for fsync();pwrite();fsync(); as the extension
mechanism, iff we're detecting a tablespace is on NFS. The first fsync()
to make sure there's no previous errors that we could mistake for
ENOSPC, the pwrite to extend, the second fsync to make sure there's
actually space. Then we can detect ENOSPC properly. That possibly does
leave some errors where we could mistake ENOSPC as something more benign
than it is, but the cases seem pretty narrow, due to the previous
fsync() (maybe the other side could be thin provisioned and get an
ENOSPC there - but in that case we didn't actually loose any data. The
only dangerous scenario I can come up with is that the remote side is on
thinly provisioned CoW system, and a concurrent write to an earlier
block runs out of space - but seriously, good riddance to you).
Given the current code we'll already try to extend in bigger chunks when
there's contention, we just need to combine the writes for those, that
ought to not be that hard now that we don't initialize bulk-extended
pages anymore. That won't solve the issue of extending during single
threaded writes, but I feel like that's secondary to actually being
correct. And using bulk-extension in more cases doesn't sound too hard
to me.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2019-02-19 17:02:44 | Re: unconstify equivalent for volatile |
Previous Message | Tom Lane | 2019-02-19 16:48:16 | Re: unconstify equivalent for volatile |