From: | Simon Riggs <simon(at)2ndQuadrant(dot)com> |
---|---|
To: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
Cc: | Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: WAL logging problem in 9.4.3? |
Date: | 2015-07-23 08:52:33 |
Message-ID: | CANP8+jLWXr+uskrHOSdcKXj+rEDWFrKgfYuC4TNhCuv7Po+jbA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 22 July 2015 at 17:21, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
> When a WAL-skipping COPY begins, we add an entry for that relation in a
> "pending-fsyncs" hash table. Whenever we perform any action on a heap that
> would normally be WAL-logged, we check if the relation is in the hash
> table, and skip WAL-logging if so.
>
> That was a simplified explanation. In reality, when WAL-skipping COPY
> begins, we also memorize the current size of the relation. Any actions on
> blocks greater than the old size are not WAL-logged, and any actions on
> smaller-numbered blocks are. This ensures that if you did any INSERTs on
> the table before the COPY, any new actions on the blocks that were already
> WAL-logged by the INSERT are also WAL-logged. And likewise if you perform
> any INSERTs after (or during, by trigger) the COPY, and they modify the new
> pages, those actions are not WAL-logged. So starting a WAL-skipping COPY
> splits the relation into two parts: the first part that is WAL-logged as
> usual, and the later part that is not WAL-logged. (there is one loose end
> marked with XXX in the patch on this, when one of the pages involved in a
> cold UPDATE is before the watermark and the other is after)
>
> The actual fsync() has been moved to the end of transaction, as we are now
> skipping WAL-logging of any actions after the COPY as well.
>
> And truncations complicate things further. If we emit a truncation WAL
> record in the transaction, we also make an entry in the hash table to
> record that. All operations on a relation that has been truncated must be
> WAL-logged as usual, because replaying the truncate record will destroy all
> data even if we fsync later. But we still optimize for "BEGIN; CREATE;
> COPY; TRUNCATE; COPY;" style patterns, because if we truncate a relation
> that has already been marked for fsync-at-COMMIT, we don't need to WAL-log
> the truncation either.
>
>
> This is more invasive than I'd like to backpatch, but I think it's the
> simplest approach that works, and doesn't disable any of the important
> optimizations we have.
I didn't like it when I first read this, but I do now. As a by product of
fixing the bug it actually extends the optimization.
You can optimize this approach so we always write WAL unless one of the two
subid fields are set, so there is no need to call smgrIsSyncPending() every
time. I couldn't see where this depended upon wal_level, but I guess its
there somewhere.
I'm unhappy about the call during MarkBufferDirtyHint() which is just too
costly. The only way to do this cheaply is to specifically mark buffers as
being BM_WAL_SKIPPED, so they do not need to be hinted. That flag would be
removed when we flush the buffers for the relation.
>
> And what reason is there to think that this would fix all the problems?
>>>
>>
>> I don't think either suggested fix could be claimed to be a great
>> solution,
>> since there is little principle here, only heuristic. Heikki's solution
>> would be the only safe way, but is not backpatchable.
>>
>
> I can't get too excited about a half-fix that leaves you with data
> corruption in some scenarios.
>
On further consideration, it seems obvious that Andres' suggestion would
not work for UPDATE or DELETE, so I now agree.
It does seem a big thing to backpatch; alternative suggestions?
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Ildus Kurbangaliev | 2015-07-23 09:06:01 | Re: RFC: replace pg_stat_activity.waiting with something more descriptive |
Previous Message | Heikki Linnakangas | 2015-07-23 08:31:38 | Re: extend pgbench expressions with functions |