Re: emergency outage requiring database restart

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: emergency outage requiring database restart
Date: 2016-10-20 21:53:42
Message-ID: CAHyXU0y5ukBJ8U5naYiMm5nD1QF9B=LXfKPo0Ttn2WSCsmUPRg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Oct 20, 2016 at 3:16 PM, Alvaro Herrera
<alvherre(at)2ndquadrant(dot)com> wrote:
> Merlin Moncure wrote:
>
>> single user mode dumps core :(
>>
>> bash-4.1$ postgres --single -D /var/lib/pgsql/9.5/data castaging
>> LOG: 00000: could not change directory to "/root": Permission denied
>> LOCATION: resolve_symlinks, exec.c:293
>> Segmentation fault (core dumped)
>>
>> Core was generated by `postgres --single -D /var/lib/pgsql/9.5/data castaging'.
>> Program terminated with signal 11, Segmentation fault.
>> #0 0x0000000000797d6f in ?? ()
>> Missing separate debuginfos, use: debuginfo-install
>> postgresql95-server-9.5.2-1PGDG.rhel6.x86_64
>> (gdb) bt
>> #0 0x0000000000797d6f in ?? ()
>> #1 0x000000000079acf1 in RelationCacheInitializePhase3 ()
>> #2 0x00000000007b35c5 in InitPostgres ()
>> #3 0x00000000006b9b53 in PostgresMain ()
>> #4 0x00000000005f30fb in main ()
>
> Maybe
> rm global/pg_internal.init
> and try again?

Will do when I can do that had to do emergency restore + some unfun
data reconstruction from the query log.

Notably there is a much larger database in the same cluster which is
undamaged. This server is new to production usage, maybe 2 months.

Here is contents of pg_extension
plpgsql
dblink
hstore
postgres_fdw
plsh * not used
pg_trgm * not used
plr * not used
tablefunc * not used
adminpack * not used
plpythonu * not used
postgis * not used
postgis_topology * not used

Short term plan is to separate the database to it's own cluster,
install replication and checksums. All queries to this database are
logged. Here is the contents of the log leading into and after the
the crash:

oct 17 crash:
2016-10-17 12:12:24 CDT [rms(at)castaging]: DETAIL: parameters: $1 =
'21121', $2 = '8', $3 = '2016-10-13', $4 = NULL, $5 = NULL, $6 = NULL,
$7 = NULL, $8 = NULL, $9 = NULL, $10 = NULL, $11 = 't', $12
2016-10-17 12:12:24 CDT [rms(at)castaging]: LOG: execute <unnamed>:
SELECT NULL AS PROCEDURE_CAT, n.nspname AS PROCEDURE_SCHEM, p.proname
AS PROCEDURE_NAME, NULL, NULL, NULL, d.description AS REMARKS
2016-10-17 12:12:24 CDT [rms(at)castaging]: LOG: execute <unnamed>:
SELECT n.nspname,p.proname,p.prorettype,p.proargtypes,
t.typtype,t.typrelid , p.proargnames, p.proargmodes, p.proallargtypes
, p.o
2016-10-17 12:12:24 CDT [rms(at)castaging]: LOG: execute <unnamed>:
select * from checkin($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11,
$12) as result
2016-10-17 12:12:24 CDT [rms(at)castaging]: DETAIL: parameters: $1 =
'114333', $2 = 'rrosillo', $3 = 'CALLER', $4 = 'Survey', $5 = 'Happy',
$6 = 'Callback', $7 = 'OTHER', $8 = '2016-10-18 01:05:00',
2016-10-17 12:12:24 CDT [rms(at)castaging]: LOG: execute S_3: COMMIT
2016-10-17 12:12:25 CDT [(at)]: ERROR: could not open relation with OID
1203933 <-- first sign of damage
2016-10-17 12:12:25 CDT [(at)]: CONTEXT: automatic analyze of table
"castaging.public.apartment"

oct 20 crash:
2016-10-20 12:46:38 CDT [postgres(at)castaging]: LOG: statement: SELECT
CallsByUser() AS byuser
2016-10-20 12:46:40 CDT [postgres(at)castaging]: LOG: statement: SELECT
CallCenterOverviewJSON() AS overview
2016-10-20 12:46:41 CDT [postgres(at)castaging]: LOG: statement: SELECT
CallCenterUserTrackingJSON() AS tracking
2016-10-20 12:46:41 CDT [postgres(at)castaging]: LOG: statement: SELECT
MarketOverviewJSON() AS market
2016-10-20 12:46:42 CDT [postgres(at)castaging]: LOG: execute <unnamed>:
SELECT SubMarketOverviewJSON($1::TEXT) AS submkt
2016-10-20 12:46:42 CDT [postgres(at)castaging]: DETAIL: parameters: $1 = '640'
2016-10-20 12:46:44 CDT [postgres(at)castaging]: LOG: statement: SELECT
CallsByUser() AS byuser
2016-10-20 12:46:46 CDT [postgres(at)castaging]: LOG: statement: SELECT
CallCenterOverviewJSON() AS overview
2016-10-20 12:46:47 CDT [postgres(at)castaging]: LOG: statement: SELECT
CallCenterUserTrackingJSON() AS tracking
2016-10-20 12:46:47 CDT [postgres(at)castaging]: ERROR:
"pg_description_o_c_o_index" is an index <-- first sign of damage
2016-10-20 12:46:47 CDT [postgres(at)castaging]: CONTEXT: SQL function
"callcenterusertrackingjson" during startup
2016-10-20 12:46:47 CDT [postgres(at)castaging]: STATEMENT: SELECT
CallCenterUserTrackingJSON() AS tracking
2016-10-20 12:46:47 CDT [postgres(at)castaging]: WARNING: leaking
still-referenced relcache entry for "pg_class_oid_index"

CallCenterUserTrackingJSON() and friends are not particularly
interesting except that they are making use of of json_agg(). They
were also called basically all day long in 5 second intervals. I
guess this isn't saying very much, but I'm starting to smell a rat
here.

merlin

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-10-20 23:20:07 Re: Improve output of BitmapAnd EXPLAIN ANALYZE
Previous Message Tom Lane 2016-10-20 21:22:21 Re: WIP: Fix invalid XML explain plans for track_io_timing