Quick Links

Re: bug, bad memory, or bad disk?

From:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To:	"'Merlin Moncure'" <mmoncure(at)gmail(dot)com>
Cc:	"'Ben Chobot'" <bench(at)silentmedia(dot)com>, "'PostgreSQL General'" <pgsql-general(at)postgresql(dot)org>
Subject:	Re: bug, bad memory, or bad disk?
Date:	2013-02-15 14:42:35
Message-ID:	00a701ce0b8a$b22b2f10$16818d30$@kapila@huawei.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

> -----Original Message-----
> From: Merlin Moncure [mailto:mmoncure(at)gmail(dot)com]
> Sent: Friday, February 15, 2013 7:56 PM
> To: Amit Kapila
> Cc: Ben Chobot; PostgreSQL General
> Subject: Re: [GENERAL] bug, bad memory, or bad disk?
>
> On Fri, Feb 15, 2013 at 8:08 AM, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
> wrote:
> > On Friday, February 15, 2013 1:33 AM Ben Chobot wrote:
> >
> >> 2013-02-13T23:13:18.042875+00:00 pgdb18-vpc postgres[20555]: [76-1]
> > ERROR: invalid memory alloc request size
> >> 1968078400
> >> 2013-02-13T23:13:18.956173+00:00 pgdb18-vpc postgres[23880]: [58-1]
> > ERROR: invalid page header in block 2948 of
> >> relation pg_tblspc/16435/PG_9.1_201105231/188417/56951641
> >> 2013-02-13T23:13:19.025971+00:00 pgdb18-vpc postgres[25027]: [36-1]
> > ERROR: could not open file
> >> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
> > 3936767042): No such file or directory
> >> 2013-02-13T23:13:19.847422+00:00 pgdb18-vpc postgres[28333]: [8-1]
> ERROR:
> > could not open file
> >> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
> > 3936767042): No such file or directory
> >> 2013-02-13T23:13:19.913595+00:00 pgdb18-vpc postgres[28894]: [8-1]
> ERROR:
> > could not open file
> >> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
> > 3936767042): No such file or directory
> >> 2013-02-13T23:13:20.043527+00:00 pgdb18-vpc postgres[20917]: [72-1]
> > ERROR: invalid memory alloc request size
> >> 1968078400
> >> 2013-02-13T23:13:21.548259+00:00 pgdb18-vpc postgres[23318]: [54-1]
> > ERROR: could not open file
> >> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
> > 3936767042): No such file or directory
> >> 2013-02-13T23:13:28.405529+00:00 pgdb18-vpc postgres[28055]: [12-1]
> > ERROR: invalid page header in block 38887 of
> >> relation pg_tblspc/16435/PG_9.1_201105231/188417/58206627
> >> 2013-02-13T23:13:29.199447+00:00 pgdb18-vpc postgres[25513]: [46-1]
> > ERROR: invalid page header in block 2368 of
> >> relation pg_tblspc/16435/PG_9.1_201105231/188417/60418945
> >
> >> There didn't seem to be much correlation to which files were
> affected, and
> > this was a critical server, so once we
> >> realized a simple reindex wasn't going to solve things, we shut it
> down
> > and brought up a slave as the new master db.
> >
> >> While that seemed to fix these issues, we soon noticed problems with
> > missing clog files. The missing clogs were outside > the range of the
> > existing clogs, so we tried using dummy clog files. It didn't help,
> and
> > running pg_check we found that > one block of one table was
> definitely
> > corrupt. Worse, that corruption had spread to all our replicas.
> >
> > Can you check that corrupted block is from one of the relations
> mentioned in
> > your errors. This is just to reconfirm.
> >
> >> I know this is a little sparse on details, but my questions are:
> >
> >> 1. What kind of fault should I be looking to fix? Because it spread
> to all
> > the replicas, both those that stream and
> >> those that replicate by replaying wals in the wal archive, I assume
> it's
> > not a storage issue. (My understanding is that > streaming replicas
> stream
> > their changes from memory, not from wals.)
> >
> > Streaming replication stream their changes from wals.
>
> Yeah. This smells like disk corruption to me, but it really could be
> anything. Unfortunately it can spread to the replicas especially if
> you're not timely about taking the master down. page checksums (a
> proposed feature) are a way of dealing with this problem.

Yes, it can be one of the reason.

One more thing I observed is the block it is trying to access in file has
very large block number (Ex. See in below log message). Is it really
feasible to have such large block number?

2013-02-13T23:13:19.025971+00:00 pgdb18-vpc postgres[25027]: [36-1] ERROR:
could not open file "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1"
(target block 3936767042): No such file or directory

Also the first error message:
"invalid memory alloc request size 1968078400"
Here why and which operation requests such a memory request?

Not sure if there is any chances of memory corruption?

> The biggest issue is the missing clog files -- did you have more than
> one replica? Were they missing on all of them?

Won't it be possible that there are less clog files on replica. Not sure if
such can happen in
Normal scenario's?

With Regards,
Amit Kapila.

In response to

Re: bug, bad memory, or bad disk? at 2013-02-15 14:26:12 from Merlin Moncure

Browse pgsql-general by date

	From	Date	Subject
Next Message	Jack Christensen	2013-02-15 15:08:12	Re: Avoiding duplication of code via views -- slower? How do people typically do this?
Previous Message	Merlin Moncure	2013-02-15 14:26:12	Re: bug, bad memory, or bad disk?