Re: [bug fix] PITR corrupts the database cluster

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: MauMau <maumau307(at)gmail(dot)com>,pgsql-hackers(at)postgresql(dot)org
Subject: Re: [bug fix] PITR corrupts the database cluster
Date: 2013-07-24 13:15:59
Message-ID: 8ab4ca23-d18e-4dcd-91f3-684161574c8c@email.android.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>On 2013-07-24 15:45:52 +0300, Heikki Linnakangas wrote:
>> Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> >On 2013-07-24 12:59:43 +0200, Andres Freund wrote:
>> >> > <Approach 2>
>> >> What we imo could do would be to drop the tablespaces in a
>*separate*
>> >> transaction *after* the transaction that removed the pg_tablespace
>> >> entry. Then an "incomplete actions" logic similar to btree and gin
>> >could
>> >> be used to remove the database directory if we crashed between the
>> >two
>> >> transactions.
>> >>
>> >> SO:
>> >> TXN1 does:
>> >> * remove catalog entries
>> >> * drop buffers
>> >> * XLogInsert(XLOG_DBASE_DROP_BEGIN)
>> >>
>> >> TXN2:
>> >> * remove_dbtablespaces
>> >> * XLogInsert(XLOG_DBASE_DROP_FINISH)
>> >>
>> >> The RM_DBASE_ID resource manager would then grow a rm_cleanup
>> >callback
>> >> (which would perform TXN2 if we failed inbetween) and a
>> >> rm_safe_restartpoint which would prevent restartpoints from
>occuring
>> >on
>> >> standby between both.
>> >>
>> >> The same should probably done for CREATE DATABASE because that
>> >currently
>> >> can result in partially copied databases lying around.
>> >
>> >And CREATE/DROP TABLESPACE.
>> >
>> >Not really related, but CREATE DATABASE's implementation makes me
>itch
>> >everytime I read parts of it...
>>
>> I've been hoping that we could get rid of the rm_cleanup mechanism
>entirely. I eliminated it for gist a while back, and I've been thinking
>of doing the same for gin and btree. The way it works currently is
>buggy - while we have rm_safe_restartpoint to avoid creating a
>restartpoint at a bad moment, there is nothing to stop you from running
>a checkpoint while incomplete actions are pending. It's possible that
>there are page locks or something that prevent it in practice, but it
>feels shaky.
>>
>> So I'd prefer a solution that doesn't rely on rm_cleanup.
>Piggybacking on commit record seems ok to me, though if we're going to
>have a lot of different things to attach there, maybe we need to
>generalize it somehow. Like, allow resource managers to attach
>arbitrary payload to the commit record, and provide a new
>rm_redo_commit function to replay them.
>
>The problem is that piggybacking on the commit record doesn't really
>fix
>the problem that we end up with a bad state if we crash in a bad
>moment.
>
>For CREATE DATABASE you will have to copy the template database
>*before*
>you commit the pg_database insert. Which means if we abort before that
>we have old data in the datadir.
>
>For DROP DATABASE, without something like incomplete actions,
>piggybacking on the commit record doesn't solve the issue of
>CHECKPOINTS
>either, because the commit record you piggybacked on could have
>committed before a checkpoint, while you still were busy deleting all
>the files.

That's no different from CREATE TABLE / INDEX and DROP TABLE / INDEX. E.g. If you crash after CREATE TABLE but before COMMIT, the file is leaked. But it's just a waste of space, everything still works.

It would be nice to fix that leak, for tables and indexes too...

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tim Kane 2013-07-24 13:22:44 Re: Suggestion for concurrent index creation using a single full scan operation
Previous Message Andres Freund 2013-07-24 13:05:30 Re: [bug fix] PITR corrupts the database cluster