Shortest offline window on database migration

From: Haroldo Kerry <hkerry(at)callix(dot)com(dot)br>
To: postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject: Shortest offline window on database migration
Date: 2019-05-30 15:08:04
Message-ID: CAHxH9rPuQaXz28Qkgpnp0-kfJgNFArjiba3P2m9iC3C7zBQ5_A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello,

We are migrating our PostgreSQL 9.6.10 database (with streaming replication
active) to a faster disk array.
We are using this opportunity to enable checksums, so we will have to do a
full backup-restore.
The database size is about 500GB, it takes about 2h:30min for a full
backup, and then about 1h to fully restore it with checksum enabled on the
new array, plus 2h to recreate the replica on the old array.

Although all synthetic tests (pgbench) indicate the new disk array is
faster, we will only be 100% confident once we see its performance on
production, so our backup plan is using our replica database on the older
array. If the new array performance is poor during production ramp up, we
can switch to the replica with little impact to our customers.

Problem is the offline window for backup, restore the full database with
checksum and recreate the replica is about 5h:30m.

One thing that occurred to us to shorten the offline window was restoring
the database to both the master and replica in parallel (of course we would
configure the replica as master do restore the database), that would shave
1h of the total time. Although this is not documented we thought that
restoring the same database to identical servers would result in binary
identical data files.

We tried this in lab. As this is not a kosher way to create a replica, we
ran a checksum comparison of all data files, and we ended up having a lot
of differences. Bummer. Both master and replica worked (no errors on logs),
but we ended up insecure about this path because of the binary differences
on data files.
But in principle it should work, right?
Has anyone been through this type of problem?

Regards,
Haroldo Kerry

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tomas Vondra 2019-05-30 15:31:28 Re: Shortest offline window on database migration
Previous Message Mariel Cherkassky 2019-05-29 11:10:59 Re: improve wals replay on secondary