| From: | Novák, Petr <novakp(at)avast(dot)com> |
|---|---|
| To: | pgsql-bugs(at)postgresql(dot)org |
| Subject: | Data corruption after restarting replica |
| Date: | 2015-02-10 11:49:13 |
| Message-ID: | CA+eEC0rZKgyWmruX-eOtrmqexYZ-PEDHyREOjby_LKP-p2h_RA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-bugs pgsql-general |
Hi all,
we're experiencing data corruption after switching streamed replica to
primary.
This is not the first time I've encountered this issue, so I'l try to
describe it in more detail.
For this particular cluster we have 6 servers in two datacenters (3 in
each). There are two instances running on each server, each with its own
port and datadir. On the first two servers in each datacenter one instance
is primary and the other is replica for the primary from the other server.
Third server holds two offsite replicas from the other datacenter (for DR
purposes)
Each replica was set up by taking pg_basebackup from primary (pg_basebackup
-h <hostname> -p 5430 -D /data2/basebackup -P -v -U <user> -x -c fast).
Then directories from initdb were replaced with the ones from basebackup
(only the configuration files remained) and the replica started and was
successfully connected to primary. It was running with no problem keeping
up with the primary. We were experiencing some connection problem between
the two datacenters, but replication didn't break.
Then we needed to take one datacenter offline due to hardware maintenance.
So I've switched the applications down, verified that no more clients were
connected to primary, then shut the primary down and restarted replica
without recovery.conf and the application were started using the new db
with no problem. Other replica even successfully reconnected to this new
primary.
Few hours from the switch lines appeared in the server log (which didn't
appear before), indicating a corruption:
ERROR: index "account_username_key" contains unexpected zero page at block
1112135
ERROR: right sibling's left-link doesn't match: block 476354 links to
1062443 instead of expected 250322 in index "account_pkey"
..and many more reporting corruption in several other indexes.
The issue was resolved by creating new indexes and dropping the affected
ones, although there were already some duplicities in the data, that has to
be resolved, as some of the indexes were unique.
This particular case uses Postgres 9.1.14 on both primary and replica. But
I've experienced similar behavior on 9.2.9. OS Centos 6.6 in all cases.
This may mean, that there can be something wrong with our configuration or
the replication setup steps, but I've set up another instance using the
same steps with no problem.
Fsync related setting are at their defaults. Data directories are on RAID10
arrays, with BBUs. Filesystem is ext4 mounted with nobarrier option.
Database is fairly large ~120GB with several 50mil+ tables, lots of indexes
and FK constraints. It is mostly queried, updates/inserts/deletes are only
several rows/s.
Any help will be appreciated.
Petr Novak
System Engineer
Avast s.r.o.
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2015-02-10 15:14:37 | Re: BUG #12749: WARNING: unrecognized node type: 701 |
| Previous Message | toni.helenius | 2015-02-10 10:44:41 | BUG #12755: pg_upgrage creates potentially dangerous delete_old_cluster.bat |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Michael Paquier | 2015-02-10 12:18:32 | Re: Logical Decoding Callbacks |
| Previous Message | Jan Strube | 2015-02-10 11:03:11 | Performance slowing down when doing same UPDATE many times |