Re: Corrupted Data ?

From: Melvin Davidson <melvin6925(at)gmail(dot)com>
To: Ioana Danes <ioanadanes(at)gmail(dot)com>
Cc: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>, pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Corrupted Data ?
Date: 2016-08-12 13:36:18
Message-ID: CANu8FiyRVyk7SsjHW=k995Nj5ryoFS9UfOhFx4pML9fHMOx0Kg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Fri, Aug 12, 2016 at 9:09 AM, Ioana Danes <ioanadanes(at)gmail(dot)com> wrote:

> Hello Everyone,
>
> I have new information on this case. I also open a post for Bucardo
> because I am still not sure what triggers this problem.
>
> The problem happened again on the same table but on another field. Few
> days ago I started a fourth database called drdb that is a PITR slave from
> db3.
>
> DB1, DB2, DRDB:
>
> -[ RECORD 1 ]-------+--------------------------
> gameplayid | 324455566
> systemuserid | 515151
> gameid | 41
> transactionid | 666556533
> drawid | *318220*
> ....
>
> DB3:
> -[ RECORD 1 ]-------+--------------------------
> gameplayid | 32445556
> systemuserid | 515151
> gameid | 41
> transactionid | 666556533
> drawid | *318216*
>
>
>
> Here are the facts I know:
>
> August 10 @ 11:10
> - The record was created on db1 and replicated to db2 and db3
> August 11 @ 2:30
> - db1, db2 and db3 are in sync (I have a script that compares the data
> for all 3 dbs every night @ 2:30 am)
> August 12 @ 2:30
> - db3 is out of sync because of this field (drawid)
> - drdb (which is PITRed from db3) is in sync with db1 and db2?????
>
>
> Because drdb (PITR slave from db3) is in sync with db1 and db2 and because
> the base backup was taken before the record in case was created, I believe
> that the xlogs are fine and I have a data kind of corruption on db3 on the
> data file for that table that happened after August 11 @ 2:30 (because the
> compare script found the dbs in sync)...
>
> Also the index is correct on db3 as the record in case (with drawid =
> *318216*) is retrieved if I filter by drawid = *318220*
>
> Any help is greatly appreciated,
>
> Thank you
>
>
> On Mon, Aug 8, 2016 at 1:25 PM, Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
> wrote:
>
>> On 08/08/2016 10:06 AM, Ioana Danes wrote:
>>
>>>
>>>
>>> On Mon, Aug 8, 2016 at 12:55 PM, Adrian Klaver> 75315811???
>>>
>>>
>>> Corrupted index on db3?
>>>
>>> yes
>>>
>>>
>>> Might want to look in the db3 logs to see if anything pops out.
>>>
>>> I checked the logs, no traces of errors or corruption.
>>>
>>>
>>> I just do not know enough about Burcardo to be of much help beyond
>>> that.
>>>
>>>
>>>
>>> it is trigger based, it saves the ids of the inserted record in a delta
>>> table and then on sync it creates copy commands to the slave. Even if
>>> there is a bug or corruption in that process I don't see how that
>>> corrupts the index on db3...
>>>
>>
>> It seems to do more then that:
>>
>> https://bucardo.org/wiki/Bucardo/Documentation/Overview
>>
>> That is why I suggested the post to the Burcardo list. Folks there will
>> have a better idea what goes under the hood.
>>
>> There is also this from a previous post:
>>
>> "Only one master is active at one time the other one is in stand by that
>> is a topic for another discussion but in our case that works well."
>>
>> Have no idea how that interaction plays out.
>>
>> At this point what I see is:
>>
>> 1) Data is entered on a master and is correct there.
>>
>> 2) Data is replicated to a single standby from one of two possible
>> sources via Bucardo and is no longer correct.
>>
>> 3) Now Bucardo uses Postgres to do its work so it is possible that
>> something in Postgres is at fault. Still the fact that the data is good on
>> the master but not in the standby tends to indicate that the act of
>> replication is the issue.
>>
>> 4) Exactly how that replication is accomplished is not obvious to me.
>>
>>
>>
>>
>>>
>>> So it is either replication bug + index corruption on db3 or data
>>> corruption on db3...
>>>
>>>
>>> In response to Melvin, the query returns no rows:
>>>
>>>
>>> SELECT n.nspname,
>>> i.relname,
>>> i.indexrelname,
>>> CASE WHEN idx.indisprimary
>>> THEN 'pkey'
>>> WHEN idx.indisunique
>>> THEN 'uidx'
>>> ELSE 'idx'
>>> END AS type,
>>> 'INVALID'
>>> FROM pg_stat_all_indexes i
>>> JOIN pg_class c ON (c.oid = i.relid)
>>> JOIN pg_namespace n ON (n.oid = c.relnamespace)
>>> JOIN pg_index idx ON (idx.indexrelid = i.indexrelid )
>>> WHERE idx.indisvalid = FALSE
>>> ORDER BY 1, 2;
>>>
>>> nspname | relname | indexrelname | type | ?column?
>>> ---------+---------+--------------+------+----------
>>> (0 rows)
>>>
>>>
>>> Thank you for your thoughts,
>>> ioana
>>>
>>>
>>>
>>> Thanks,
>>> ioana
>>>
>>>
>>>
>>> --
>>> Adrian Klaver
>>> adrian(dot)klaver(at)aklaver(dot)com <mailto:adrian(dot)klaver(at)aklaver(dot)com>
>>> <mailto:adrian(dot)klaver(at)aklaver(dot)com
>>> <mailto:adrian(dot)klaver(at)aklaver(dot)com>>
>>>
>>>
>>>
>>>
>>> --
>>> Adrian Klaver
>>> adrian(dot)klaver(at)aklaver(dot)com <mailto:adrian(dot)klaver(at)aklaver(dot)com>
>>>
>>>
>>>
>>
>> --
>> Adrian Klaver
>> adrian(dot)klaver(at)aklaver(dot)com
>>
>
>
Just out of curiosity, have you tried to REINDEX any of the indexes that
might be involved on db3?
Even though PostgreSQL considers them valid, it's possible that their
pointers could be incorrect.

Additionally, I would compare the EXPLAIN query on all 3 db's and see if
there is any difference between the three.

--
*Melvin Davidson*
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Francisco Olarte 2016-08-12 13:36:46 Re: Corrupted Data ?
Previous Message Ioana Danes 2016-08-12 13:09:33 Re: Corrupted Data ?