Re: WAL logging problem in 9.4.3?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WAL logging problem in 9.4.3?
Date: 2015-07-06 15:14:40
Message-ID: 27532.1436195680@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Fujii Masao <masao(dot)fujii(at)gmail(dot)com> writes:
> On Sat, Jul 4, 2015 at 2:26 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> We actually used to use a different relfilenode, but optimized that
>> away: cab9a0656c36739f59277b34fea8ab9438395869
>>
>> commit cab9a0656c36739f59277b34fea8ab9438395869
>> Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
>> Date: Sun Aug 23 19:23:41 2009 +0000
>>
>> Make TRUNCATE do truncate-in-place when processing a relation that was created
>> or previously truncated in the current (sub)transaction. This is safe since
>> if the (sub)transaction later rolls back, we'd just discard the rel's current
>> physical file anyway. This avoids unreasonable growth in the number of
>> transient files when a relation is repeatedly truncated. Per a performance
>> gripe a couple weeks ago from Todd Cook.
>>
>> to me the reasoning here looks flawed.

> Before this commit, when I ran the above test scenario, no data loss happened.

Actually I think what is broken here is COPY's test to decide whether it
can omit writing WAL:

* Check to see if we can avoid writing WAL
*
* If archive logging/streaming is not enabled *and* either
* - table was created in same transaction as this COPY
* - data is being written to relfilenode created in this transaction
* then we can skip writing WAL. It's safe because if the transaction
* doesn't commit, we'll discard the table (or the new relfilenode file).
* If it does commit, we'll have done the heap_sync at the bottom of this
* routine first.

The problem with that analysis is that it supposes that, if we crash and
recover, the WAL replay sequence will not touch the data. What's killing
us in this example is the replay of the TRUNCATE, but that is not the only
possibility. For example consider this modification of Fujii-san's test
case:

BEGIN;
CREATE TABLE test (i int primary key);
INSERT INTO test VALUES(-1);
\copy test from /tmp/num.csv with csv
COMMIT;
SELECT COUNT(*) FROM test;

The COUNT() correctly says 11 rows, but after crash-and-recover,
only the row with -1 is there. This is because the INSERT writes
out an INSERT+INIT WAL record, which we happily replay, clobbering
the data added later by COPY.

We might have to give up on this COPY optimization :-(. I'm not
sure what would be a safe rule for deciding that we can skip WAL
logging in this situation, but I am pretty sure that it would
require keeping information we don't currently keep about what's
happened earlier in the transaction.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2015-07-06 15:21:23 Re: WAL logging problem in 9.4.3?
Previous Message Andres Freund 2015-07-06 15:01:44 Re: ERROR: unexpected data beyond EOF