Re: Refactoring the checkpointer's fsync request queue

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Shawn Debnath <sdn(at)amazon(dot)com>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Refactoring the checkpointer's fsync request queue
Date: 2019-02-13 05:40:05
Message-ID: CAEepm=0_0r4H1LLbG42XbVHHLoDnRh3JWGtj-rnvdWa4TV77Vw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 13, 2019 at 3:58 PM Shawn Debnath <sdn(at)amazon(dot)com> wrote:
> On Wed, Jan 30, 2019 at 09:59:38PM -0800, Shawn Debnath wrote:
> > I wonder if it might be better to introduce two different functions
> > catering to the two different use cases for forcing an immediate sync:
> >
> > - sync a relation
> > smgrimmedsyncrel(SMgrRelation, ForkNumber)
> > - sync a specific segment
> > smgrimmedsyncseg(SMgrRelation, ForkNumber, SegmentNumber)
> >
> > This will avoid having to specify InvalidSegmentNumber for majority of
> > the callers today.
>
> I have gone ahead and rebased the refactor patch so it could cleanly
> apply on heapam.c, see patch v7.
>
> I am also attaching a patch (v8) that implements smgrimmedsyncrel() and
> smgrimmedsyncseg() as I mentioned in the previous email. It avoids
> callers to pass in InvalidSegmentNumber when they just want to sync the
> whole relation. As a side effect, I was able to get rid of some extra
> checkpointer.h includes.

Hi Shawn,

Thanks! And sorry for not replying sooner -- I got distracted by
FOSDEM (and the associated 20 thousand miles of travel). On that trip
I had a chance to discuss this patch with Andres Freund in person, and
he opined that it might be better for the fsync request queue to work
in terms of pathnames. Instead of the approach in this patch, where a
backend sends an fsync request for { reflfilenode, segno } inside
mdwrite(), and then the checkpointer processes the request by calling
smgrdimmedsyncrel(), he speculated that it'd be better to have
mdwrite() send an fsync request for a pathname, and then the
checkpointer would just open that file by name and fsync() it. That
is, the checkpointer wouldn't call back into smgr.

One of the advantages of that approach is that there are probably
other files that need to be fsync'd for each checkpoint that could
benefit from being offloaded to the checkpointer. Another is that you
break the strange cycle mentioned above.

Here's a possible problem with it. The fsync request messages would
have to be either large (MAXPGPATH) or variable sized and potentially
large. I am a bit worried that such messages would be problematic for
the atomicity requirement of the (future) fd-passing patch that passes
it via a Unix domain socket (which is a bit tricky because
SOCK_SEQPACKET and SOCK_DGRAM aren't portable enough, so we probably
have to use SOCK_STREAM, but there is no formal guarantee like
PIPE_BUF; we know that in practice small messages will be atomic, but
certainty decreases with larger messages. This needs more study...).
You need to include the path even in a message containing an fd,
because the checkpointer will use that as a hashtable key to merge
received requests. Perhaps you'd solve that by using a small tag that
can be converted back to a path (as you noticed my last patch had some
leftover dead code from an experiment along those lines), but then I
think you'd finish up needing an smgr interface to convert it back to
the path (implementation different for md.c, undofile.c, slru.c). So
you don't exactly break the cycle mentioned earlier. Hmm, or perhaps
you could avoid even thinking about atomicity by passing 1 byte
fd-bearing messages via the pipe, and pathnames via shared memory, in
the same order.

Another consideration if we do that is that the existing scheme has a
kind of hierarchy that allows fsync requests to be cancelled in bulk
when you drop relations and databases. That is, the checkpointer
knows about the internal hierarchy of tablespace, db, rel, seg. If we
get rid of that and have just paths, it seems like a bad idea to teach
the checkpointer about the internal structure of the paths (even
though we know they contain the same elements encoded somehow). You'd
have to send an explicit cancel for every key; that is, if you're
dropping a relation, you need to generate a cancel message for every
segment, and if you're dropping a database, you need to generate a
cancel message for every segment of every relation. Once again, if
you used some kind of tag that is passed back to smgr, you could
probably come up with a way to do it.

Thoughts?

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jamison, Kirk 2019-02-13 05:48:28 RE: Cache relation sizes?
Previous Message Michael Paquier 2019-02-13 05:13:09 Better error messages when lacking connection slots for autovacuum workers and bgworkers