From: | Peter Geoghegan <pg(at)bowt(dot)ie> |
---|---|
To: | Alex F <phoedos16(at)gmail(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org> |
Subject: | Re: BUG #16833: postgresql 13.1 process crash every hour |
Date: | 2021-05-17 18:17:12 |
Message-ID: | CAH2-Wzn2p__Y+u0PsjXobMBUsmnFfWHuPzR3NQB2MRa_nP7Pww@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Mon, May 17, 2021 at 2:30 AM Alex F <phoedos16(at)gmail(dot)com> wrote:
> Is it possible to extend the error log which can help to understand what exactly went wrong?
> For example, if error log look like this:
> 2021-05-14 06:10:54 UTC [22258]: user=,db=,app=,client= LOG: server process (PID 22273) was terminated by signal 11: Segmentation fault
> 2021-05-14 06:10:54 UTC [22258]: user=,db=,app=,client= DETAIL: Failed process was running: REFRESH MATERIALIZED VIEW CONCURRENTLY project.product_master_mv
> ***CAUSED BY violated for index "name_original_idx_s"***
> e.g. trace marked with *** symbols can really help user to understand issue root cause and significantly decrease database recovery time.
> In my case I had to create a separate VM, create a database from scratch and recover it from pg_dump. Unfortunately mentioned actions took a significant downtime.
Once the database is corrupt it's more or less impossible to provide
hard guarantees about anything. We can only try our best to avoid the
worst consequences, such as a hard crash. This is guided by practical
experience and feedback from users. While this failure is clearly very
unfriendly, there is no getting around the fact that the real problem
began before there was any crash or error. Perhaps *long* before the
first crash, even.
> In case of master-standby configuration WAL replication does not save standby servers from broken objects (broken index in described case).
> Please advice is it possible to use logical replication here? From my understanding logical replication shouldn't push broken objects on standby.
Unfortunately there are no simple answers. There is no reason to
believe that the corruption is limited to the indexes. I'd even say
that it's unlikely to be limited to indexes. What we see from amcheck
looks very much like storage level inconsistencies.
You'll need to do some kind of root cause analysis, so that you can
find the underlying issue and systematically eliminate it.
--
Peter Geoghegan
From | Date | Subject | |
---|---|---|---|
Next Message | Rony Kurniawan | 2021-05-17 18:19:31 | Re: [External] : Re: BUG #17005: Enhancement request: Improve walsender throughput by aggregating multiple messages in one send |
Previous Message | Andres Freund | 2021-05-17 16:27:53 | Re: BUG #17005: Enhancement request: Improve walsender throughput by aggregating multiple messages in one send |