Re: Memory Errors

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Sam Nelson <samn(at)consistentstate(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Memory Errors
Date: 2010-09-08 20:32:36
Message-ID: AANLkTi=E8itOJPzs-3kCADm+e3ww9J29GSjQ=zO4auJa@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Sep 8, 2010 at 4:03 PM, Sam Nelson <samn(at)consistentstate(dot)com> wrote:
> It figures I'd have an idea right after posting to the mailing list.
> Yeah, running COPY foo TO stdout; gets me a list of data before erroring
> out, so I did a copy (select * from foo order by id asc) to stdout; to see
> if I could make some kind of guess as to whether this was related to a
> single row or something else.
> I got the id of the last row the copy to command was able to grab normally
> and tried to figure out the next id.  The following started to make me think
> along the lines of some kinda bad corruption (even before getting responses
> that agreed with that):
> Assuming that the last id copied was 1500:
> 1) select * from foo where id = (select min(id) from foo where id > 1500);
> Results in 0 rows
> 2) select min(id) from foo where id > 1500;
> Results in, for example, 200000
> 3) select max(id) from foo where id > 1500;
> Results in, for example, 90000 (a much lower number than returned by min)
> 4) select id from foo where id > 1500 order by id asc limit 10;
> Results in (for example):
> 200000
> 202000
> 210273
> 220980
> 15005
> 15102
> 15104
> 15110
> 15111
> 15113
> So ... yes, it seems that those four id's are somehow part of the problem.
> They're on amazon EC2 boxes (yeah, we're not too fond of the EC2 boxes
> either), so memtest isn't available, but no new corruption has cropped up
> since they stopped killing the waiting queries (I just double checked - they
> were getting corrupted rows constantly, and we haven't gotten one since that
> script stopped killing queries).

That's actually a startling indictment of ec2 -- how were you killing
your queries exactly? You say this is repeatable? What's your
setting of full_page_writes?

one way to identify and potentially nuke bad records of this kind is
to do something like:

select max(length(field1)) from foo order by 1 desc limit 5;

where field1 is the first varlen field (text, bytea, etc) from left to
right order. look for bogously high values and move on to the next
field if you don't find any. once you hit a bad value, try deleting
the record by it's key.

once you've found/deleted them all, immediately pull off a dump, then
rebuild the table. as always, take a filesystem dump before doing
this type of surgery...

merlin
merlin

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message John R Pierce 2010-09-08 20:35:00 Re: error while autovacuuming
Previous Message Susan Cassidy 2010-09-08 20:25:06 Re: how do i count() similar items