From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Merlin Moncure <mmoncure(at)gmail(dot)com>,Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>,Bruce Momjian <bruce(at)momjian(dot)us>,PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: emergency outage requiring database restart |
Date: | 2016-10-26 18:34:30 |
Message-ID: | 147977C4-6107-47C6-9628-475EC6263E2C@anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On October 26, 2016 8:57:22 PM GMT+03:00, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>On Wed, Oct 26, 2016 at 12:43 PM, Merlin Moncure <mmoncure(at)gmail(dot)com>
>wrote:
>> On Wed, Oct 26, 2016 at 11:35 AM, Merlin Moncure <mmoncure(at)gmail(dot)com>
>wrote:
>>> On Tue, Oct 25, 2016 at 3:08 PM, Merlin Moncure <mmoncure(at)gmail(dot)com>
>wrote:
>>>> Confirmation of problem re-occurrence will come in a few days.
>I'm
>>>> much more likely to believe 6+sigma occurrence (storage, freak bug,
>>>> etc) should it prove the problem goes away post rebuild.
>>>
>>> ok, no major reported outage yet, but just got:
>>>
>>> 2016-10-26 11:27:55 CDT [postgres(at)castaging]: ERROR: invalid page
>in
>>> block 12 of relation base/203883/1259
>
>*) I've now strongly correlated this routine with the damage.
>
>[root(at)rcdylsdbmpf001 ~]# cat
>/var/lib/pgsql/9.5/data/pg_log/postgresql-26.log | grep -i
>pushmarketsample | head -5
>2016-10-26 11:26:27 CDT [postgres(at)castaging]: LOG: execute <unnamed>:
>SELECT PushMarketSample($1::TEXT) AS published
>2016-10-26 11:26:40 CDT [postgres(at)castaging]: LOG: execute <unnamed>:
>SELECT PushMarketSample($1::TEXT) AS published
>PL/pgSQL function pushmarketsample(text,date,integer) line 103 at SQL
>statement
>PL/pgSQL function pushmarketsample(text,date,integer) line 103 at SQL
>statement
>2016-10-26 11:26:42 CDT [postgres(at)castaging]: STATEMENT: SELECT
>PushMarketSample($1::TEXT) AS published
>
>*) First invocation was 11:26:27 CDT
>
>*) Second invocation was 11:26:40 and gave checksum error (as noted
>earlier 11:26:42)
>
>*) Routine attached (if interested)
>
>My next step is to set up test environment and jam this routine
>aggressively to see what happens.
Any chance that plsh or the script it executes does anything with the file descriptors it inherits? That'd certainly one way to get into odd corruption issues.
We processor really should use O_CLOEXEC for the majority of it file handles.
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
From | Date | Subject | |
---|---|---|---|
Next Message | Merlin Moncure | 2016-10-26 18:38:49 | Re: emergency outage requiring database restart |
Previous Message | Robert Haas | 2016-10-26 18:33:42 | Re: Issues with building snap packages and psql |