From: | Paul Guo <guopa(at)vmware(dot)com> |
---|---|
To: | "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Two patches to speed up pg_rewind. |
Date: | 2021-01-27 09:18:48 |
Message-ID: | 7C1703E7-F3F3-43FA-86EB-177C671BF33C@vmware.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
While reading pg_rewind code I found two things could speed up pg_rewind.
Attached are the patches.
First one: pg_rewind would fsync the whole pgdata directory on the target by default,
but that is a waste since usually just part of the files/directories on
the target are modified. Other files on the target should have been flushed
since pg_rewind requires a clean shutdown before doing the real work. This
would help the scenario that the target postgres instance includes millions of
files, which has been seen in a real environment.
There are several things that may need further discussions:
1. PG_FLUSH_DATA_WORKS was introduced as "Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data”,
but now the code guarded by it is just pre_sync_fname() relevant so we might want
to rename it as HAVE_PRE_SYNC kind of name?
2. Pre_sync_fname() implementation
The code looks like this:
#if defined(HAVE_SYNC_FILE_RANGE)
(void) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
(void) posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
I’m a bit suspicious about calling posix_fadvise() with POSIX_FADV_DONTNEED.
I did not check the Linux Kernel code but according to the man
page I suspect that this option might cause the kernel tends to evict the related kernel
pages from the page cache, which might not be something we expect. This is
not a big issue since sync_file_range() should exist on many widely used Linux.
Also I’m not sure how much we could benefit from the pre_sync code. Also note if the
directory has a lot of files or the IO is fast, pre_sync_fname() might slow down
the process instead. The reasons are: If there are a lot of files it is possible that we need
to read the already-synced-and-evicted inode from disk (by open()-ing) after rewinding since
the inode cache in Linux Kernel is limited; also if the IO is faster and kernel do background
dirty page flush quickly, pre_sync_fname() might just waste cpu cycles.
A better solution might be launch a separate pthread and do fsync one by one
when pg_rewind finishes handling one file. pg_basebackup could use the solution also.
Anyway this is independent of this patch.
Second one is use copy_file_range() for the local rewind case to replace read()+write().
This introduces copy_file_range() check and HAVE_COPY_FILE_RANGE so other
code could use copy_file_range() if needed. copy_file_range() was introduced
In high-version Linux Kernel, in low-version Linux or other Unix-like OS mmap()
might be better than read()+write() but copy_file_range() is more interesting
given that it could skip the data copying in some file systems - this could benefit more
on Linux fs on network-based block storage.
Regards,
Paul
Attachment | Content-Type | Size |
---|---|---|
0001-Fsync-the-affected-files-directories-only-in-pg_rewi.patch | application/octet-stream | 8.7 KB |
0002-Use-copy_file_range-for-file-copying-in-pg_rewind.patch | application/octet-stream | 4.8 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Bharath Rupireddy | 2021-01-27 09:27:00 | Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax |
Previous Message | kuroda.hayato@fujitsu.com | 2021-01-27 09:18:28 | RE: ECPG: proposal for new DECLARE STATEMENT |