Re: Two fsync related performance issues?

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Paul Guo <pguo(at)pivotal(dot)io>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Two fsync related performance issues?
Date: 2020-05-19 20:30:33
Message-ID: CA+hUKGLi20-oSkTFS_uc3fuWxrJ=-X=zt7rYaeCiJKGsauG=xg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, May 20, 2020 at 12:51 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, May 11, 2020 at 8:43 PM Paul Guo <pguo(at)pivotal(dot)io> wrote:
> > I have this concern since I saw an issue in a real product environment that the startup process needs 10+ seconds to start wal replay after relaunch due to elog(PANIC) (it was seen on postgres based product Greenplum but it is a common issue in postgres also). I highly suspect the delay was mostly due to this. Also it is noticed that on public clouds fsync is much slower than that on local storage so the slowness should be more severe on cloud. If we at least disable fsync on the table directories we could skip a lot of file fsync - this may save a lot of seconds during crash recovery.
>
> I've seen this problem be way worse than that. Running fsync() on all
> the files and performing the unlogged table cleanup steps can together
> take minutes or, I think, even tens of minutes. What I think sucks
> most in this area is that we don't even emit any log messages if the
> process takes a long time, so the user has no idea why things are
> apparently hanging. I think we really ought to try to figure out some
> way to give the user a periodic progress indication when this kind of
> thing is underway, so that they at least have some idea what's
> happening.
>
> As Tom says, I don't think there's any realistic way that we can
> disable it altogether, but maybe there's some way we could make it
> quicker, like some kind of parallelism, or by overlapping it with
> other things. It seems to me that we have to complete the fsync pass
> before we can safely checkpoint, but I don't know that it needs to be
> done any sooner than that... not sure though.

I suppose you could with the whole directory tree what
register_dirty_segment() does, for the pathnames that you recognise as
regular md.c segment names. Then it'd be done as part of the next
checkpoint, though you might want to bring the pre_sync_fname() stuff
back into it somehow to get more I/O parallelism on Linux (elsewhere
it does nothing). Of course that syscall could block, and the
checkpointer queue can fill up and then you have to do it
synchronously anyway, so you'd have to look into whether that's
enough.

The whole concept of SyncDataDirectory() bothers me a bit though,
because although it's apparently trying to be safe by being
conservative, it assumes a model of write back error handling that we
now know to be false on Linux. And then it thrashes the inode cache
to make it more likely that error state is forgotten, just for good
measure.

What would a precise version of this look like? Maybe we really only
need to fsync relation files that recovery modifies (as we already
do), plus those that it would have touched but didn't because of the
page LSN (a new behaviour to catch segment files that were written by
the last run but not yet flushed, which I guess in practice would only
happen with full_page_writes=off)? (If you were paranoid enough to
believe that the buffer cache were actively trying to trick you and
marked dirty pages clean and lost the error state so you'll never hear
about it, you might even want to rewrite such pages once.)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2020-05-19 21:07:28 Re: Extension ownership and misuse of SET ROLE/SET SESSION AUTHORIZATION
Previous Message Tom Lane 2020-05-19 19:50:41 Re: factorial function/phase out postfix operators?