Re: WAL scenario valid?

From: Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>
To: "prakhar jauhari *EXTERN*" <prak840(at)gmail(dot)com>, Sergey Konoplev <gray(dot)ru(at)gmail(dot)com>
Cc: Keith Ouellette <Keith(dot)Ouellette(at)airgas(dot)com>, "pgsql-admin(at)postgresql(dot)org" <pgsql-admin(at)postgresql(dot)org>
Subject: Re: WAL scenario valid?
Date: 2013-06-19 07:41:05
Message-ID: A737B7A37273E048B164557ADEF4A58B17BB4276@ntex2010a.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

prakhar jauhari wrote:
> I am facing a similar kind of problem, but in a two node setup. Streaming replication is being used
> with a high wal_keep_segments, using log shipping to bump up timelines on the standby server to setup
> SR(streaming replication).
> DB1 - master
>
> DB2 - standby
>
>
> When a switchover happens DB2 becomes the new master and when DB1 comes up it will act as the standby
> to the new master (.history files from new master are copied to DB1 to bump up its timeline so as to
> setup SR). DB1 is not recreated from scratch. This runs fine in normal switchover, but there seems to
> be problem in the following situation, leading to database corruption:
>
>
> Current state :
> DB1 - master
> DB2 - standby
>
>
> Now the failing scenario:
>
>
>
> 1. DB2 machine goes down.
>
> 2. After some time DB1 machine also goes down (DB2 is still down).
>
> 3. Now DB2 comes up (it will join the cluster as master as DB1 is still down).
>
> 4. DB2 is started as master postgresql.
>
> 5. Now DB1 comes up (it will join the cluster as standby to DB2)
>
> 6. Now when DB1 attempts to SR with DB2, DB1 gets corrupted.
>
>
> Looked into the issue and found that when DB1 went down initially, it created some WAL's which were
> not synced to DB2 as it was already down.
>
> Now when DB2 started as master it still had not played the last few WAL's created by DB1(when it was
> master). DB2 starts as master properly.
>
> When DB1 came as standby to DB2, it bumped it timeline using history file from DB2, but when SR was
> setup with DB2, DB1 gets corrupted.
>
>
> Now the question is:
>
>
> 1. Is this a theoretically valid approach?
>
> 2. If it is a valid approach, then how can i detect such a scenario (where SR will corrupt the DB)? So
> that i can go for a basebackup in such situation.

If you want to use the old primary as new standby without a new backup,
you have to ascertain that all transactions from the former have
been replayed at the latter.

To figure out where the primary currently is, you can
SELECT pg_current_xlog_location();

To figure how much the standby has replayed, you can
SELECT pg_last_xlog_replay_location();

Of course this only works if both are up.

I think that it would be tricky to automatize that; I'd choose
making a new backup after each failover.

In the event of a controlled failover it might be an option.

I am surprised that the scenario you described leads to
corruption; I would have expected an error message.

Yours,
Laurenz Albe

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message girish R G peetle 2013-06-19 08:32:16 pg_stop_backup is not archiving latest transaction log from pg_xlog directory
Previous Message prakhar jauhari 2013-06-19 06:20:14 Re: WAL scenario valid?