Re: Postgres, fsync, and OSs (specifically linux)

From: Andres Freund <andres(at)anarazel(dot)de>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Postgres, fsync, and OSs (specifically linux)
Date: 2018-04-30 16:09:45
Message-ID: 20180430160945.5s5qfoqryhtmugxl@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2018-04-30 13:03:24 +0800, Craig Ringer wrote:
> Hrm, something else that just came up. On 9.6+ we use sync_file_range.
> It's surely going to eat errors:
>
> rc = sync_file_range(fd, offset, nbytes,
> SYNC_FILE_RANGE_WRITE);
>
> /* don't error out, this is just a performance optimization */
> if (rc != 0)
> {
> ereport(WARNING,
> (errcode_for_file_access(),
> errmsg("could not flush dirty data: %m")));
> }

It's not. Only SYNC_FILE_RANGE_WAIT_{BEFORE,AFTER} eat errors. Which
seems sensible, because they could be considered data integrity
operations.

fs/sync.c:
SYSCALL_DEFINE4(sync_file_range, int, fd, loff_t, offset, loff_t, nbytes,
unsigned int, flags)
{
...

if (flags & SYNC_FILE_RANGE_WAIT_BEFORE) {
ret = file_fdatawait_range(f.file, offset, endbyte);
if (ret < 0)
goto out_put;
}

if (flags & SYNC_FILE_RANGE_WRITE) {
ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
WB_SYNC_NONE);
if (ret < 0)
goto out_put;
}

if (flags & SYNC_FILE_RANGE_WAIT_AFTER)
ret = file_fdatawait_range(f.file, offset, endbyte);

where
int file_fdatawait_range(struct file *file, loff_t start_byte, loff_t end_byte)
{
struct address_space *mapping = file->f_mapping;

__filemap_fdatawait_range(mapping, start_byte, end_byte);
return file_check_and_advance_wb_err(file);
}
EXPORT_SYMBOL(file_fdatawait_range);

the critical call is file_check_and_advance_wb_err(). That's the call
that checks and clears errors.

Would be good to add a kernel xfstest (gets used on all relevant FS
these days), to make sure that doesn't change.

> I'm very suspicious about the safety of the msync() path too.

That seems justified however:

SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
{
...
error = vfs_fsync_range(file, fstart, fend, 1);
..

*/
int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
{
...
return file->f_op->fsync(file, start, end, datasync);
}
EXPORT_SYMBOL(vfs_fsync_range);

int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
{
...
ret = file_write_and_wait_range(file, start, end);
if (ret)
return ret;
...

STATIC int
xfs_file_fsync(
struct file *file,
loff_t start,
loff_t end,
int datasync)
{
...
error = file_write_and_wait_range(file, start, end);
if (error)
return error;

int file_write_and_wait_range(struct file *file, loff_t lstart, loff_t lend)
{
...
err2 = file_check_and_advance_wb_err(file);
if (!err)
err = err2;
return err;
}

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-04-30 17:07:49 Re: "could not reattach to shared memory" on buildfarm member dory
Previous Message Andrew Dunstan 2018-04-30 16:09:12 Re: Intermittent ECPG test failures on Windows buildfarm machines