From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Accidental removal of a file causing various problems |
Date: | 2018-08-24 18:46:39 |
Message-ID: | 23318.1535136399@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com> writes:
> 1. The user soon found out that they can no longer connect to any database
> in the cluster. Not just the one to which the affected table belonged, but
> no other database in the cluster. The affected table is a regular user
> table (actually a toast table).
Please define "can no longer connect". What happened *exactly*?
How long did it take to start failing like that (was this perhaps a
shutdown-because-of-impending-wraparound situation)?
> 2. So they restarted the database server. While that fixed the connection
> problem, they started seeing toast errors on the table to which the missing
> file belonged to. The missing file was recreated at the database restart,
> but of course it was filled in with all zeroes, causing data corruption.
Doesn't seem exactly surprising, if some toast data went missing.
> 3. To make things worse, the corruption then got propagated to the standbys
> too. We don't know if the original file removal was replicated to the
> standby, but it seems unlikely.
This is certainly unsurprising.
> I've a test case that reproduce all of these effects if a backend file is
> forcefully removed,
Let's see it.
Note that this:
> WARNING: could not write block 27094010 of base/56972584/56980980
> DETAIL: Multiple failures --- write error might be permanent.
> ERROR: could not open file "base/56972584/56980980.69" (target block
> 27094010): previous segment is only 12641 blocks
> CONTEXT: writing block 27094010 of relation base/56972584/56980980
does not say that the .69 file is missing. It says that .68 (or, maybe,
some even-earlier segment) was smaller than 1GB, which is a different
matter. Still data corruption, but I don't think I believe it was a
stray "rm".
Oh, and what PG version are we talking about?
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Andrew Dunstan | 2018-08-24 19:10:02 | Re: Windows vs C99 (was Re: C99 compliance for src/port/snprintf.c) |
Previous Message | Tom Lane | 2018-08-24 18:38:28 | Re: Windows vs C99 (was Re: C99 compliance for src/port/snprintf.c) |