Re: Next steps in debugging database storage problems?

From: Terry Schmitt <terry(dot)schmitt(at)gmail(dot)com>
To: Jacob Bunk Nielsen <jacob(at)bunk(dot)cc>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Next steps in debugging database storage problems?
Date: 2014-08-15 17:00:51
Message-ID: CAOOcysyd6ur=sqmNBiWRO9V6yDKSGGNOxESc-MS70tWHuLTUuw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I can't offer a whole lot of detail at this point, but I experienced a
pretty bad caching issue about 2 years ago using XFS.

We were migrating a 1TB+ Oracle database to EDB's Advanced server 9.1
(Close enough for this discussion). I normally use ext4, but decided to try
XFS for this build-out.
This was a Redhat 6.x system using NetApp SAN for storage. We extensively
leverage FlexClones for creating production "read-only" instances as well
as our development and testing environments. We take the snapshots of the
running database storage and create FlexClones. The newly cloned database
does a quick recovery on startup and away it goes. This has worked
perfectly when using ext4 for years.

The problem I experienced with XFS, was when I started up the new clone for
the first time. We would start getting various block read errors when
accessing tables and indexes and knew the database was totally unreliable
at this point.It was super painful troubleshooting as I could recreate the
problem consistently, but it took a couple days of loading data and some
creative scripts to recreate.

NetApp snapshots are consistent and reliable. It was clearly obvious that
the data on disk did not match the data cached by the OS and/or XFS. We
worked with Redhat, but never arrived at a solution. I finally gave up and
switched back to ext4 and the problem went away.

T

On Fri, Aug 15, 2014 at 12:23 AM, Jacob Bunk Nielsen <jacob(at)bunk(dot)cc> wrote:

> Hi
>
> On the 1st of July 2014 Jacob Bunk Nielsen <jacob(at)bunk(dot)cc> wrote:
>
> > We have a PostgreSQL 9.3.4 running in an LXC container on Debian
> > Wheezy on a Linux 3.10.43 kernel on a Dell R620 server. Data are
> > stored on a XFS file system. We are seeing problems such as:
> >
> > unexpected data beyond EOF in block 2 of relation
> base/805208133/1238511128
> >
> > and
> >
> > could not read block 5 in file "base/805208348/1259338118": read only
> > 0 of 8192 bytes
> >
> > This seems to occur every few days after the server has been up for
> > 30-40 days. If we reboot the server it'll be another 30-40 days before
> > we see any problems again.
> >
> > The server has been running fine on a Dell R710 for a long time, and was
> > upgraded to a Dell R620 last year, when the problems started. We have
> > tried switching to a different Dell R620, but that did not make a
> > difference. We've seen this with kernels 3.2, 3.4 and 3.10.
>
> This time it took 45 days before this happened:
>
> LOG: unexpected EOF on standby connection
> ERROR: unexpected data beyond EOF in block 140 of relation
> base/805208885/805209852
> HINT: This has been seen to occur with buggy kernels; consider updating
> your system.
>
> It always happens with small tables with lots of inserts and deletes.
> From previous experience we know that it's now going to happen again in
> a few days, so we'll probably try to schedule a reboot to give us
> another 30-40 days.
>
> Is anyone else seeing problems with PostgreSQL on XFS filesystems?
>
> Any hints on how to debug what goes wrong here would be still be greatly
> appreciated.
>
> > We have multiple other PostgreSQL servers running in a similar setup
> > without causing any problems, but this server is probably the busiest of
> > our PostgreSQL servers.
>
> This is still the case.
>
> Best regards
>
> Jacob
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message hubert depesz lubaczewski 2014-08-15 18:09:37 Archiving skipped an xlog?
Previous Message Russell Keane 2014-08-15 16:11:38 Re: Upgrading 9.0 to 9.3 - permission denied to pg_upgrade_utility.log