From: | Sean Laurent <sean(at)studyblue(dot)com> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | Postgres 9.01, Amazon EC2/EBS, XFS, JDBC and lost connections |
Date: | 2011-10-06 17:21:52 |
Message-ID: | CAK=aZ=k+QSGZFCE8SX8-KbgYDJZy+-5ebmFs3aTLZEkSBb3LQw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
We've been running into a particularly strange problem that I'm trying to
better understand. The super short version is that our application servers
lose their connection to the database when I run a backup during periods of
higher load and fail to reconnect.
Here's an overview of the setup:
- PostgreSQL 9.0.1 hosted on a cc1.4xlarge Amazon EC2 instance running
CentOS 5.6
- 8 disk RAID-0 array of EBS volumes used for primary data storage
- 4 disk RAID-0 array of EBS volumes used for transaction logs
- Root partition is ext3
- RAID arrays are xfs
Backups are taken using a script that runs the following workflow:
- Tell Postgres to start a backup: SELECT pg_start_backup('RAID backup');
- Run "xfs_freeze" on the primary RAID array
- Tell Amazon to take snapshots of each of the EBS volumes
- Run "xfs_freeze -u" to thaw the primary RAID array
- Run "xfs_freeze" on the transaction log RAID array
- Tell Amazon to take snapshots of each of the EBS volumes
- Run "xfs_freeze -u" to thaw the transaction log RAID array
- Tell Postgres the backup is finished: SELECT pg_stop_backup();
- Remove old WAL files
The whole process takes roughly 7 seconds on average. The RAID arrays are
frozen for roughly 2 seconds on average.
Within a few seconds of the backup, our application servers start throwing
exceptions that indicate the database connection was closed. Meanwhile,
Postgres still shows the connections and we start seeing a really high
number (for us) of locks in the database. The application servers refuse to
recover and must be killed and restarted. Once they're killed off, the
connections actually go away and the locks disappear.
What's particularly weird is that this doesn't happen all the time. The
backups were running every hour, but we have only seen the app servers crash
5-10 times over the course of a month.
Has anyone encountered anything like this? Do any of these steps have
ramifications that I'm not considering? Especially something that might
explain the app server failure?
Thanks.
Sean Laurent
Director of Operations
StudyBlue, Inc.
From | Date | Subject | |
---|---|---|---|
Next Message | Carlos Mennens | 2011-10-06 18:31:59 | Tuning Variables For PostgreSQL |
Previous Message | Adam Cornett | 2011-10-06 16:20:23 | Re: Backup Database Question |