| From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> | 
|---|---|
| To: | Dmitry Dolgov <9erthalion6(at)gmail(dot)com> | 
| Cc: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com> | 
| Subject: | Re: Changing shared_buffers without restart | 
| Date: | 2025-04-21 14:16:31 | 
| Message-ID: | CA+hUKGLQhsZ1dEf5Zo6JuPbs6n-qX=cTGy49feKf1iFA_TBP1g@mail.gmail.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On Mon, Apr 21, 2025 at 9:30 PM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
> Yeah, that would work and will allow to avoid MAP_FIXED and mremap, which are
> questionable from portability point of view. This leaves memfd_create, and I'm
> still not completely clear on it's portability -- it seems to be specific to
> Linux, but others provide compatible implementation as well.
Something like this should work, roughly based on DSM code except here
we don't really need the name so we unlink it immediately, at the
slight risk of leaking it if the postmaster is killed between those
lines (maybe someone should go and tell POSIX to support the special
name SHM_ANON or some other way to avoid that; I can't see any
portable workaround).  Not tested/compiled, just a sketch:
#ifdef HAVE_MEMFD_CREATE
  /* Anonymous shared memory region. */
  fd = memfd_create("foo", MFD_CLOEXEC | huge_pages_flags);
#else
  /* Standard POSIX insists on a name, which we unlink immediately. */
  do
  {
      char tmp[80];
      snprintf(tmp, sizeof(tmp), "PostgreSQL.%u",
pg_prng_uint32(&pg_global_prng_state));
      fd.= shm_open(tmp, O_CREAT | O_EXCL);
      if (fd >= 0)
        shm_unlink(tmp);
  } while (fd < 0 && errno == EXIST);
#endif
> Let me experiment with this idea a bit, I would like to make sure there are no
> other limitations we might face.
One thing I'm still wondering about is whether you really need all
this multi-phase barrier stuff, or even need to stop other backends
from running at all while doing the resize.  I guess that's related to
your remapping scheme, but supposing you find the simple
ftruncate()-only approach to be good, my next question is:  why isn't
it enough to wait for all backends to agree to stop allocating new
buffers in the range to be truncated, and then left them continue to
run as normal?  As far as they would be concerned, the in-progress
downsize has already happened, though it could be reverted later if
the eviction phase fails.  Then the coordinator could start evicting
buffers and truncating the shared memory object, which are
phases/steps, sure, but it's not clear to me why they need other
backends' help.
It sounds like Windows might need a second ProcSignalBarrier poke in
order to call VirtualUnlock() in every backend.  That's based on that
Usenet discussion I lobbed in here the other day; I haven't tried it
myself or fully grokked why it works, and there could well be other
ways, IDK.  Assuming it's the right approach, between the first poke
to make all backends accept the new lower size and the second poke to
unlock the memory, I don't see why they need to wait.  I suppose it
would be the same ProcSignalBarrier, but behave differently based on a
control variables.  I suppose there could also be a third poke, if you
want to consider the operation to be fully complete only once they
have all actually done that unlock step, but it may also be OK not to
worry about that, IDK.
On the other hand, maybe it just feels less risky if you stop the
whole world, or maybe you envisage parallelising the eviction work, or
there is some correctness concern I haven't grokked yet, but what?
> > *You might also want to use fallocate after ftruncate on Linux to
> > avoid SIGBUS on allocation failure on first touch page fault, which
> > raises portability questions since it's unspecified whether you can do
> > that with shm fds and fails on some systems, but it let's call that an
> > independent topic as it's not affected by this choice.
>
> I'm afraid it would be strictly neccessary to do fallocate, otherwise we're
> back where we were before reservation accounting for huge pages in Linux (lot's
> of people were facing unexpected SIGBUS when dealing with cgroups).
Yeah.  FWIW here is where we decided to gate that on __linux__ while
fixing that for DSM:
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Jacob Champion | 2025-04-21 15:18:58 | Re: dispchar for oauth_client_secret | 
| Previous Message | Andrew Dunstan | 2025-04-21 14:10:38 | Re: Regression test fails when 1) old PG is installed and 2) meson/ninja build is used |