From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Jan Wieck <jan(at)wi3ck(dot)info>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: WAL Re-Writes |
Date: | 2016-02-08 05:08:55 |
Message-ID: | CAA4eK1JL_rSc5tS13M-aPnqtLBgFJkPX2-hC7vxmuAYZTYp3qw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Feb 3, 2016 at 7:12 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Wed, Feb 3, 2016 at 7:28 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> > On further testing, it has been observed that misaligned writes could
> > cause reads even when blocks related to file are not in-memory, so
> > I think what Jan is describing is right. The case where there is
> > absolutely zero chance of reads is when we write in OS-page boundary
> > which is generally 4K. However I still think it is okay to provide an
> > option for WAL writing in smaller chunks (512 bytes , 1024 bytes, etc)
> > for the cases when these are beneficial like when wal_level is
> > greater than equal to Archive and keep default as OS-page size if
> > the same is smaller than 8K.
>
> Hmm, a little research seems to suggest that 4kB pages are standard on
> almost every system we might care about: x86_64, x86, Power, Itanium,
> ARMv7. Sparc uses 8kB, though, and a search through the Linux kernel
> sources (grep for PAGE_SHIFT) suggests that there are other obscure
> architectures that can at least optionally use larger pages, plus a
> few that can use smaller ones.
>
> I'd like this to be something that users don't have to configure, and
> it seems like that should be possible. We can detect the page size on
> non-Windows systems using sysctl(_SC_PAGESIZE), and on Windows by
> using GetSystemInfo. And I think it's safe to make this decision at
> configure time, because the page size is a function of the hardware
> architecture (it seems there are obscure systems that support multiple
> page sizes, but I don't care about them particularly). So what I
> think we should do is set an XLOG_WRITESZ along with XLOG_BLCKSZ and
> set it to the smaller of XLOG_BLCKSZ and the system page size. If we
> can't determine the system page size, assume 4kB.
>
I think deciding it automatically without user require to configure it,
certainly has merits, but what about some cases where user can get
benefits by configuring themselves like the cases where we use
PG_O_DIRECT flag for WAL (with o_direct, it will by bypass OS
buffers and won't cause misaligned writes even for smaller chunk sizes
like 512 bytes or so). Some googling [1] reveals that other databases
also provides user with option to configure wal block/chunk size (as
BLOCKSIZE), although they seem to decide chunk size based on
disk-sector size.
An additional thought, which is not necessarily related to this patch is,
if user chooses and or we decide to write in 512 bytes sized chunks,
which is usually a disk sector size, then can't we think of avoiding
CRC for each record for such cases, because each WAL write in
it-self will be atomic. While reading, if we process in wal-chunk-sized
units, then I think it should be possible to detect end-of-wal based
on data read.
[1] -
http://docs.oracle.com/cd/E11882_01/server.112/e41084/clauses004.htm#SQLRF52268
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | kharagesuraj | 2016-02-08 05:58:21 | Re: Support for N synchronous standby servers - take 2 |
Previous Message | Jinhua Luo | 2016-02-08 04:51:17 | Re: Does plpython support threading? |