From: | Tambet Matiisen <tambet(dot)matiisen(at)gmail(dot)com> |
---|---|
To: | Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov> |
Cc: | pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: BUG #5929: ERROR: found toasted toast chunk for toast value 260340218 in pg_toast_260339342 |
Date: | 2011-03-16 19:08:07 |
Message-ID: | 4D810A97.3000602@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On 16.03.2011 17:09, Kevin Grittner wrote:
> Tambet Matiisen<tambet(dot)matiisen(at)gmail(dot)com> wrote:
>
>> Pre-live database is restored from live database dump every night.
>
> How is that done? A single pg_dump of the entire live database
> restored using psql? Are both database servers at the same
> PostgreSQL version?
Yes, I use pg_dump on live server and the result is rdiff-backupped into
development server. Whole SQL dump is 12G without compression and the
rdiff delta is about 10-20MB every day. Then I drop pre-live database on
development server and recreate it using createdb and psql.
For a while development server was running 8.4 and live server 8.1. Now
both are 8.4, but this shouldn't matter, as I do backup and restore via SQL.
>
>> So far the errors have been in pre-live database,
>
> You're running pg_dump against a database you just restored from a
> pg_dump image?
Hmm, yeah. This sounds rather dumb, but haven't got to that yet.
Development server contains some additional databases as well, that do
not exist on live server.
>
>> Usually the next day error was gone. I mostly blamed badly timed
>> backup and restore scripts, although this shouldn't result in
>> errors.
>
> No it shouldn't -- if you're following any of the documented backup
> and restore techniques. I have a suspicion that you're just doing a
> file copy without stopping the live database or properly following
> the documented PITR backup and recovery techniques.
No, I don't do any advanced backup tricks. Just plain pg_dump and psql.
>
> This time the error is not in pre-live database and therefore it
>> doesn't go away.
>
> If I understand you, this sounds like corruption in the live
> database; nothing on the pre-live database is part of causing this
> problem.
This would be the case when I do filesystem level copy, but I do not.
>
>> The server is also running [...] Samba [...]
>
> I hope you're not trusting Samba too far. For a while we were using
> it in backups across our WAN, and it mangled at least one file
> almost every day. We had to take to running md5sum against both
> ends for each file to ensure we didn't get garbage (until we
> converted everything to use TCP communications, which have never
> mangled anything for us).
As I said, I'm using rdiff-backup to transfer pure SQL files.
>
>> Both fsync and full_page_writes are on.
>
> Good. Without those an OS or hardware crash can corrupt your
> database.
Actually they are commented out, but I suppose this means "on".
>
>> OK, I don't have UPS for this machine, but power has been stable.
>> Current uptime is 32 days, which I bet is from the last kernel
>> update.
>
> OK. A power outage wouldn't be too likely to matter if you have
> fsync and full_page_writes on.
That's a relief :).
>
>> Currently I blame either faulty memory or faulty software RAID
>> driver. I can easily eliminate the memory cause by running
>> memtest86 for few hours
>
> Is this ECC memory? If not, even a good test doesn't prove that a
> RAM problem didn't cause the corruption.
It's not ECC memory.
>
>> Now, off to buy UPS...
>
> Not a bad idea, but it doesn't sound like lack of that is likely to
> have caused the corruption in your live database, based on the
> settings you mentioned. (Assuming those settings are in use on the
> live server.)
Checked live server, it has also fsync=on and full_page_writes=on. But
it shouldn't matter, because backup of live server doesn't give any errors.
It is possible, that restore of pre-live database using psql lasts so
long, that backup of the same database using pg_dump is already kicking
in. But again, this shouldn't matter and it doesn't explain why the last
error is in another database, that hasn't changed for months.
Now I have to find time to run memtest.
Tambet
From | Date | Subject | |
---|---|---|---|
Next Message | Kevin Grittner | 2011-03-16 20:29:35 | Re: BUG #5929: ERROR: found toasted toast chunk for toast value 260340218 in pg_toast_260339342 |
Previous Message | Robert Brewer | 2011-03-16 18:14:31 | Re: SELECT '(1, nan, 3)'::cube; |