From: | Fujii Masao <masao(dot)fujii(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)2ndquadrant(dot)com> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: WAL replay should fdatasync() segments? |
Date: | 2014-01-22 17:54:28 |
Message-ID: | CAHGQGwHXZ2VuBNGniMcMsg4o1YtBJ-JSPX_JtaR4FY-=o4YCcg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Jan 23, 2014 at 2:08 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-01-23 02:05:48 +0900, Fujii Masao wrote:
>> On Thu, Jan 23, 2014 at 1:21 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> > Hi,
>> >
>> > Currently, XLogInsert(), XLogFlush() or XLogBackgroundFlush() will
>> > write() data before fdatasync()ing them (duh, kinda obvious). But I
>> > think given the current recovery code that leaves a window where we can
>> > get into strange inconsistencies.
>> > Consider what happens if postgres (not the OS!) crashes after writing
>> > WAL data to the OS, but before fdatasync()ing it. Replay will happily
>> > read that record from disk and replay it, which is fine. At the end of
>> > recovery we then will start inserting new records, and those will be
>> > properly fsynced to disk.
>> > But if the *OS* crashes in that moment we might get into the strange
>> > situation where older records might be lost since they weren't
>> > fsync()ed, but newer records and the control file will persist.
>> >
>> > I think for a primary that window is relatively small, but I think it's
>> > a good bit bigger for a standby, especially if it's promoted.
>>
>> In normal streaming replication case, ISTM that window is not bigger for
>> the standby because basically the standby replays only the WAL data
>> which walreceiver fsync'd to the disk. But if it replays the WAL file which
>> was fetched from the archive, that WAL file might not have been flushed
>> to the disk yet. In this case, that window might become bigger...
>
> Yea, but if the walreceiver receives data and crashes/disconnects before
> fsync(), we'll read it from pg_xlog, rigth? And if we promote, we'll
> start inserting new records before establishing a new checkpoint.
Yeah, true. Such unflushed WAL file can be read by the subsequent recovery...
Regards,
--
Fujii Masao
From | Date | Subject | |
---|---|---|---|
Next Message | Heikki Linnakangas | 2014-01-22 17:56:38 | Re: pgsql: Compress GIN posting lists, for smaller index size. |
Previous Message | Andrew Dunstan | 2014-01-22 17:49:21 | Re: new json funcs |