Re: emergency outage requiring database restart

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: emergency outage requiring database restart
Date: 2016-10-26 17:43:29
Message-ID: CAHyXU0z7O7TqXOk9mj7A9KxT6r5aczjCBZQbvgP7vsrjfDeH6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 26, 2016 at 11:35 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Tue, Oct 25, 2016 at 3:08 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> Confirmation of problem re-occurrence will come in a few days. I'm
>> much more likely to believe 6+sigma occurrence (storage, freak bug,
>> etc) should it prove the problem goes away post rebuild.
>
> ok, no major reported outage yet, but just got:
>
> 2016-10-26 11:27:55 CDT [postgres(at)castaging]: ERROR: invalid page in
> block 12 of relation base/203883/1259

*) still on 9.5.2 :( not in control of the package update process

*) getting more of these:
[root(at)rcdylsdbmpf001 data]# cat
/var/lib/pgsql/9.5/data_checksum/pg_log/postgresql-26.log | grep
"invalid page"
2016-10-26 11:26:42 CDT [postgres(at)castaging]: ERROR: invalid page in
block 10 of relation base/203883/1259
2016-10-26 11:27:55 CDT [postgres(at)castaging]: ERROR: invalid page in
block 12 of relation base/203883/1259
2016-10-26 12:16:44 CDT [postgres(at)castaging]: ERROR: invalid page in
block 13 of relation base/203883/1259
2016-10-26 12:18:58 CDT [postgres(at)castaging]: ERROR: invalid page in
block 15 of relation base/203883/1259
2016-10-26 12:19:12 CDT [postgres(at)castaging]: ERROR: invalid page in
block 7 of relation base/203883/2662

castaging=# select relname from pg_class where oid in(1259, 2662);
relname
────────────────────
pg_class_oid_index
pg_class

*) only one database in the cluster is damaged. I do not suspect
storage at this point

*) Currently I can execute pg_dump with no problems. So far impact
is not severe but soon I have to do dump/restore

*) age(xmin) from pg_class gives all reasonable values (approx 757k)
that aren't 2147483647

*) All cases of statement executing getting this error in a pl/pgsql
routine that does the following:
BEGIN
...
<several postgres_fdw querying same 1 record 1 column tabel>

DROP TABLE IF EXISTS foo;
CREATE TEMP TABLE foo ON COMMIT DROP AS ...
SELECT <sqsh function push sql script to sql sever> ... INTO Result;

DROP TABLE IF EXISTS bar;
CREATE TEMP TABLE bar ON COMMIT DROP AS ...

PERFORM <sqsh function removing script>;

EXCEPTION
WHEN OTHERS THEN
PERFORM <sqsh function removing script>;
RAISE;
END;

*) page verification miss is numerically close
[root(at)rcdylsdbmpf001 ~]# cat
/var/lib/pgsql/9.5/data/pg_log/postgresql-26.log | grep "page
verification"
2016-10-26 11:26:42 CDT [postgres(at)castaging]: WARNING: page
verification failed, calculated checksum 37251 but expected 37244
2016-10-26 11:27:55 CDT [postgres(at)castaging]: WARNING: page
verification failed, calculated checksum 37249 but expected 37244
2016-10-26 12:16:44 CDT [postgres(at)castaging]: WARNING: page
verification failed, calculated checksum 44363 but expected 44364
2016-10-26 12:18:58 CDT [postgres(at)castaging]: WARNING: page
verification failed, calculated checksum 49525 but expected 49539
2016-10-26 12:19:12 CDT [postgres(at)castaging]: WARNING: page
verification failed, calculated checksum 37345 but expected 37340

*) sample log leading into the first error (all queries are logged)
2016-10-26 11:26:40 CDT [postgres(at)castaging]: LOG: execute <unnamed>:
SELECT PushMarketSample($1::TEXT) AS published
2016-10-26 11:26:40 CDT [postgres(at)castaging]: DETAIL: parameters: $1 = '8840'
2016-10-26 11:26:40 CDT [postgres(at)mpf2]: LOG: statement: START
TRANSACTION ISOLATION LEVEL REPEATABLE READ
2016-10-26 11:26:40 CDT [postgres(at)mpf2]: LOG: statement: SAVEPOINT s2
2016-10-26 11:26:40 CDT [postgres(at)mpf2]: LOG: execute <unnamed>:
DECLARE c1 CURSOR FOR
SELECT period FROM public.dataentryperiod
2016-10-26 11:26:40 CDT [postgres(at)mpf2]: LOG: statement: FETCH 100 FROM c1
2016-10-26 11:26:41 CDT [postgres(at)mpf2]: LOG: statement: CLOSE c1
2016-10-26 11:26:41 CDT [postgres(at)mpf2]: LOG: execute <unnamed>:
DECLARE c2 CURSOR FOR
SELECT period FROM public.dataentryperiod
2016-10-26 11:26:41 CDT [postgres(at)mpf2]: LOG: statement: FETCH 100 FROM c2
2016-10-26 11:26:41 CDT [postgres(at)mpf2]: LOG: statement: CLOSE c2
2016-10-26 11:26:41 CDT [rms(at)mpf2]: ERROR: relation "tblagent" does
not exist at character 15
2016-10-26 11:26:41 CDT [rms(at)mpf2]: STATEMENT: select * from tblagent
2016-10-26 11:26:41 CDT [rms(at)mpf2]: LOG: execute <unnamed>: SELECT
CURRENT_SCHEMA()
2016-10-26 11:26:42 CDT [postgres(at)mpf2]: LOG: execute <unnamed>:
DECLARE c3 CURSOR FOR
SELECT period FROM public.dataentryperiod
2016-10-26 11:26:42 CDT [postgres(at)mpf2]: LOG: statement: FETCH 100 FROM c3
2016-10-26 11:26:42 CDT [postgres(at)mpf2]: LOG: statement: CLOSE c3
2016-10-26 11:26:42 CDT [postgres(at)mpf2]: LOG: execute <unnamed>:
DECLARE c4 CURSOR FOR
SELECT period FROM public.dataentryperiod
2016-10-26 11:26:42 CDT [postgres(at)mpf2]: LOG: statement: FETCH 100 FROM c4
2016-10-26 11:26:42 CDT [postgres(at)mpf2]: LOG: statement: CLOSE c4
2016-10-26 11:26:42 CDT [postgres(at)castaging]: WARNING: page
verification failed, calculated checksum 37251 but expected 37244
2016-10-26 11:26:42 CDT [postgres(at)castaging]: CONTEXT: SQL statement
"CREATE TEMP TABLE ApartmentPublishBySubmkt ON COMMIT DROP AS
SELECT
ApartmentId,
TotalUnits,
SubmarketId,
SUM(Rent * UnitCount) / NULLIF(SUM(UnitCount), 0) AS Rent,
SUM(Occupancy * UnitCount) / NULLIF(SUM(UnitCount), 0) AS Occupancy
FROM AptSample
GROUP BY 1, 2, 3"
PL/pgSQL function pushmarketsample(text,date,integer) line 103
at SQL statement
2016-10-26 11:26:42 CDT [postgres(at)mpf2]: LOG: statement: ROLLBACK TO
SAVEPOINT s2; RELEASE SAVEPOINT s2
2016-10-26 11:26:42 CDT [postgres(at)castaging]: ERROR: invalid page in
block 10 of relation base/203883/1259
2016-10-26 11:26:42 CDT [postgres(at)castaging]: CONTEXT: SQL statement
"CREATE TEMP TABLE ApartmentPublishBySubmkt ON COMMIT DROP AS
SELECT
ApartmentId,
TotalUnits,
SubmarketId,
SUM(Rent * UnitCount) / NULLIF(SUM(UnitCount), 0) AS Rent,

merlin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-10-26 17:49:58 Re: Improving RLS planning
Previous Message Robert Haas 2016-10-26 17:39:01 Re: Improving RLS planning