Re: DATA corruption after promoting slave to master

From: Karthik Iyer <karthik(dot)i(at)directi(dot)com>
To: Shaun Thomas <sthomas(at)optionshouse(dot)com>, pgsql-general(at)postgresql(dot)org
Cc: KiritSinh <kirit(dot)p(at)directi(dot)com>
Subject: Re: DATA corruption after promoting slave to master
Date: 2014-06-26 14:44:46
Message-ID: 53AC31DE.7030902@directi.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Thanks Shaun.

We reindexed all the primary and unique keys of all the tables, But we
did not reindex the tables. You think we should do that also ?

Also, do you think we should do a clean dump restore to eliminate all
data inconsistencies.

One more query :

We managed to get the old server up. We are trying to play the
difference in data by checking the log files(statement logs). You think
there is any other easy alternatives ?

Thanks in advance,

- Karthik

On Thursday 26 June 2014 01:09 AM, Shaun Thomas wrote:
> On 06/25/2014 06:29 AM, Karthik Iyer wrote:
>
>> [2] We also have a daemon process which copies the latest partial WAL
>> log file (which postgres is currently writing to, under pg_xlog/) every
>> 3 secs to a different location.
>
> No. No, no, no, no no. No.
>
> Also, no. Partial WAL files are not valid for recovery. In fact, I'm
> surprised the standby node even applied it at all.
>
>> We are seeing these problems in the newly promoted master now:
>>
>> 1. when we run queries on primary key, we don't get the rows even
>> if it exist in db. However if we force query not to use index, we get
>> those entries.
>> 2. there are duplicate values for primary keys
>
> This is no surprise. Your slave has partial data commits, which means
> your table, index, or possibly both, are corrupt.
>
> The first thing you need to do is back up any tables you've noticed
> are having this behavior. Second, try to reindex the tables that are
> having problems. The errors you are seeing are due to the data and
> indexes being out of sync. If you get an error that says the reindex
> fails due to duplicate values, you can do this:
>
> SET enable_indexscan TO false;
> SET enable_bitmapscan TO false;
>
> SELECT primary_key, count(1)
> FROM broken_table
> GROUP BY 1
> HAVING count(1) > 1;
>
> For any ID that comes back, do this:
>
> SELECT ctid, *
> FROM broken_table
> WHERE primary_key = [value(s) from above];
>
> Then you need to delete one of the bad rows after deciding which. Use
> the CTID of the row you want to delete:
>
> DELETE FROM broken_table
> WHERE ctid = 'whatever';
>
> Then reindex the table so the correct values are properly covered.
> Doing this for all of your corrupt tables may take a while depending
> on how many there are.
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Shaun Thomas 2014-06-26 14:49:31 Re: Fast data, slow data
Previous Message Merlin Moncure 2014-06-26 13:20:18 Re: Alternative to psql -c ?