Re: should crash recovery ignore checkpoint_flush_after ?

From: Andres Freund <andres(at)anarazel(dot)de>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Subject: Re: should crash recovery ignore checkpoint_flush_after ?
Date: 2020-01-18 23:32:02
Message-ID: 20200118233202.ax27prmsvvxqaytx@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-01-19 09:52:21 +1300, Thomas Munro wrote:
> On Sun, Jan 19, 2020 at 3:08 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > As I understand, the first thing that happens syncing every file in the data
> > dir, like in initdb --sync. These instances were both 5+TB on zfs, with
> > compression, so that's slow, but tolerable, and at least understandable, and
> > with visible progress in ps.
> >
> > The 2nd stage replays WAL. strace show's it's occasionally running
> > sync_file_range, and I think recovery might've been several times faster if
> > we'd just dumped the data at the OS ASAP, fsync once per file. In fact, I've
> > just kill -9 the recovery process and edited the config to disable this lest it
> > spend all night in recovery.
>
> Does sync_file_range() even do anything for non-mmap'd files on ZFS?

Good point. Next time it might be worthwhile to use strace -T to see
whether the sync_file_range calls actually take meaningful time.

> Non-mmap'd ZFS data is not in the Linux page cache, and I think
> sync_file_range() works at that level. At a guess, there'd need to be
> a new VFS file_operation so that ZFS could get a callback to handle
> data in its ARC.

Yea, it requires the pages to be in the pagecache to do anything:

int sync_file_range(struct file *file, loff_t offset, loff_t nbytes,
unsigned int flags)
{
...

if (flags & SYNC_FILE_RANGE_WRITE) {
int sync_mode = WB_SYNC_NONE;

if ((flags & SYNC_FILE_RANGE_WRITE_AND_WAIT) ==
SYNC_FILE_RANGE_WRITE_AND_WAIT)
sync_mode = WB_SYNC_ALL;

ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
sync_mode);
if (ret < 0)
goto out;
}

and then

int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
loff_t end, int sync_mode)
{
int ret;
struct writeback_control wbc = {
.sync_mode = sync_mode,
.nr_to_write = LONG_MAX,
.range_start = start,
.range_end = end,
};

if (!mapping_cap_writeback_dirty(mapping) ||
!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
return 0;

which means that if there's no pages in the pagecache for the relevant
range, it'll just finish here. *Iff* there are some, say because
something else mmap()ed a section, it'd potentially call into
address_space->writepages() callback. So it's possible to emulate
enough state for ZFS or such to still get sync_file_range() call into it
(by setting up a pseudo map tagged as dirty), but it's not really the
normal path.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Fetter 2020-01-19 00:00:52 Re: Use compiler intrinsics for bit ops in hash
Previous Message Andres Freund 2020-01-18 23:22:00 Re: should crash recovery ignore checkpoint_flush_after ?