Re: PANIC during VACUUM

From: German Becker <german(dot)becker(at)gmail(dot)com>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>, "pgsql-admin(at)postgresql(dot)org" <pgsql-admin(at)postgresql(dot)org>
Subject: Re: PANIC during VACUUM
Date: 2013-04-30 12:26:07
Message-ID: CALyjCLvoWybyrcmvH65J=rOYPp1bO6fUbMSBEOUjh8mQd3uv1Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

OK I apologise for the lack of clarity of the first message. Let
me summarize the steps that lead me to the error.
I have 2 servers running Ubuntu 12.04 on which I am testing Postgres 9.1.9.
I set up streaming replication between them (no synchronous replication)
Both servers have 4 SATA hard drives with ext3 file system set up as follows

sda --> / main os and the database files, except for the ones defined
below
sdb ---> pg_xlog directory
sdc ----> one tablespace where heavy transaction tables are stored
sdd --> another tablespace where big historic tables are stored.

archiving mode is on and the archive location is sda (and from there to the
hot-standby server)
For testing I Populate the database with the data currently in production
(currently Postgres 8.3).
Then I run several load testing etc.
For tunning / improving the archiving process I needed to generate big
ammount of WAL. To do so I just deleted the contents of one big table, and
then VACUUM it, like this

DELETE form bigtable;
VACUUM bigtable;

And I found the error reported.
I repeated the whole process (creating a new cluster, populating it with
data - allways the same data- , seting up replication) a couple of times
after that and I found the error again about 90% of the time. I tried
deleting a big portion of the table and the error did not appeard. It
only appears after deleting ALL. Also in some cases I didn't run the VACUUM
command manually, and the error ocurred during auto-vacuum-
My last test, was, in case there was a hardware problem in the primary, to
trigger the standby server and try the vacuum there. With the same results.
Here a chunk of the log:

2013-04-29 17:02:21 ART [12024]: [32-1] PANIC: XX001: corrupted item
pointer: offset = 8128, size = 80
2013-04-29 17:02:21 ART [12024]: [33-1] LOCATION: PageIndexMultiDelete,
bufpage.c:779
2013-04-29 17:02:21 ART [12024]: [34-1] STATEMENT: VACUUM callshopcdrs ;
2013-04-29 17:02:21 ART [23787]: [8-1] LOG: server process (PID 12024) was
terminated by signal 6: Aborte
d
2013-04-29 17:02:21 ART [23787]: [9-1] LOG: terminating any other active
server processes
2013-04-29 17:02:21 ART [7300]: [2-1] WARNING: terminating connection
because of crash of another server
process
2013-04-29 17:02:21 ART [7300]: [3-1] DETAIL: The postmaster has commanded
this server process to roll ba
ck the current transaction and exit, because another server process exited
abnormally and possibly corrupt
ed shared memory.
2013-04-29 17:02:21 ART [7300]: [4-1] HINT: In a moment you should be able
to reconnect to the database a
nd repeat your command.
2013-04-29 17:02:21 ART [30304]: [1-1] FATAL: the database system is in
recovery mode
2013-04-29 17:02:21 ART [23787]: [10-1] LOG: archiver process (PID 7301)
exited with exit code 1
2013-04-29 17:02:21 ART [23787]: [11-1] LOG: all server processes
terminated; reinitializing
2013-04-29 17:02:21 ART [30305]: [1-1] LOG: database system was
interrupted; last known up at 2013-04-29
16:59:01 ART
2013-04-29 17:02:21 ART [30305]: [2-1] LOG: database system was not
properly shut down; automatic recover
y in progress
2013-04-29 17:02:21 ART [30305]: [3-1] LOG: redo starts at 11/497D4338
2013-04-29 17:02:21 ART [30305]: [4-1] LOG: invalid magic number 0000 in
log file 17, segment 73, offset
8216576
2013-04-29 17:02:21 ART [30305]: [5-1] LOG: redo done at 11/497D4440
2013-04-29 17:02:22 ART [30308]: [1-1] LOG: autovacuum launcher started
2013-04-29 17:02:22 ART [23787]: [12-1] LOG: database system is ready to
accept connections

There is a core file generated, it is 7GB big:

$ file core
core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from
'postgres: postgres tvoip3 [local] VACUUM'

Many thanks for your help and let me know any extra information that might
be useful.

--

German

On Tue, Apr 30, 2013 at 8:51 AM, Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:

> [please don't top-post]
>
> German Becker <german(dot)becker(at)gmail(dot)com> wrote:
> > Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at> wrote:
> >> German Becker wrote:
>
> >>> I am testing version 9.1.9 before putting it in production. One
> >>> of my tests involved deleting a the contents of a big table ( ~
> >>> 13 GB size) and then VACUUMing it. During VACUUM PANICS.
>
> >> If you mess with the database files, errors like this are to be
> >> expected.
>
> > Thanks for your reply. In which sense did I mess with the
> > database files?
>
> You didn't say how you deleted the contents of that big table, and
> it appears that Albe assumed you deleted or truncated the
> underlying disk file rather than using the DELETE or TRUNCATE SQL
> statement.
>
> In any event, more details would help people come up with ideas on
> what might be wrong.
>
> http://wiki.postgresql.org/wiki/Guide_to_reporting_problems
>
> --
> Kevin Grittner
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Scott Whitney 2013-04-30 19:42:58 Some replication-related notes and questions
Previous Message Albe Laurenz 2013-04-30 12:08:25 Re: PANIC during VACUUM