Re: Database corruption event, unlockable rows, possibly bogus virtual xids? (-1/4444444444)

From: Ned Wolpert <ned(dot)wolpert(at)imemories(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: Re: Database corruption event, unlockable rows, possibly bogus virtual xids? (-1/4444444444)
Date: 2013-02-21 19:47:34
Message-ID: CAFehBFmHXFr_x9xK8F2YeNVmMNKAAoMxSRV==Ce2b0n0s0j7bw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

My bad, I'll re-post this to pgsql-general with more data on my running
environment....

On Thu, Feb 21, 2013 at 11:16 AM, Ned Wolpert <ned(dot)wolpert(at)imemories(dot)com>wrote:

> Folks-
>
> I'm doing a postmortem on a corruption event we had. I have an idea on
> what happened, but not sure. I figure I'd share what happened and see if
> I'm close to right here.
>
> Event: Running 9.1.6 with hot-standby, archiving 4 months of wal files,
> and even a nightly pg_dump all. 50G database. Trying to update or delete a
> row in a small (21 row, but heavily used table) would lock up completely.
> Never finish. Removed all clients, restarted the db instance, no joy. Check
> pg_stat_activity, and nothing that wasn't idle.... run the delete, locked
> up.
>
> Ran (SELECT*FROM pg_class JOIN pg_locks ON pg_locks.relation =
> pg_class.oid;) with no clients touching this row, fresh restart of the db,
> and saw virtualtransactions against this same table where the values would
> be -1/nnnnn were nnnnn was a huge number. Turns out we had about 159
> entries from different tables in the database. Checked hot-standby and, of
> course, no locks or anything. Switched to hot-standby.
>
> Hot-standby instantly gained these locks, Also noticed that 2 prepared
> transactions migrated to the hot-standby. Binary upgraded to 9.1.8, locks
> still existed. Ended up executing the one thing we knew would work. Take
> the site down, pg_dumpall to fresh instance. Everything is fine.
>
> A little more background: We were running 9.1.4 back when 9.1.6 came out.
> We saw there was possible corruption issues and did a binary upgrade and
> reindexing. Everything seemed to be fine, but I wonder if we really had
> problems back then. We rebuilt the hot-standby after the binary upgrade via
> normal restore and wal-file replays. I should also note that this row that
> had the lock on it that would not go away, was created by an app server
> that was killed (via -9) since it was non-responsive, and the row 'create
> date' (in db and logs) is the exact time the app server was killed.
>
> I was wondering if a) these virtualtransactions that start with '-1/'
> indicate a problem, b) if this could have happened from pre 9.1.6
> corruption that was fixed in 9.1.6. Or, could this have occurred when we
> killed that app server? Or.... am I looking in the wrong place.
>
> I do still have the old data directories so I can start them up and check
> out the dataset. Any advice?
>
> --
> Virtually, Ned Wolpert
>
> "Settle thy studies, Faustus, and begin..." --Marlowe
>

--
Virtually, Ned Wolpert

"Settle thy studies, Faustus, and begin..." --Marlowe

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Albe Laurenz 2013-02-22 08:15:50 Re: Database corruption event, unlockable rows, possibly bogus virtual xids? (-1/4444444444)
Previous Message Kevin Grittner 2013-02-21 18:23:25 Re: Pg 9.1 master-slave replication