Re: Why corruption memory in one database affects all the cluster?

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Ru Devel <rudevel(at)gmail(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Why corruption memory in one database affects all the cluster?
Date: 2014-07-14 20:20:06
Message-ID: CAMkU=1wAfm4UTGCwwydhPX-X0DkHvPXrAZ4JXDMdnrMSFC30sg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sun, Jul 13, 2014 at 12:07 PM, Ru Devel <rudevel(at)gmail(dot)com> wrote:

> Hello,
>
> I have postgres 9.3.4 running on linux, and ~20 databases in the cluster.
>
> All the cluster was migrated from 9.2 using pg_upgradecluster.
>
> After migration autovacuum started to fail in one database, causing entire
> cluster crashes:
>
>
> 2014-07-13 21:16:24 MSK [5665]: [1-1] db=,user= PANIC: corrupted item
> pointer: offset = 5292, size = 24
> 2014-07-13 21:16:24 MSK [29131]: [417-1] db=,user= LOG: server process
> (PID 5665) was terminated by signal 6: Aborted
> 2014-07-13 21:16:24 MSK [29131]: [418-1] db=,user= DETAIL: Failed process
> was running: autovacuum: VACUUM public.postfix_stat0 (to prevent wraparound)
> 2014-07-13 21:16:24 MSK [29131]: [419-1] db=,user= LOG: terminating any
> other active server processes
> 2014-07-13 21:16:24 MSK [29597]: [1-1] db=,user= WARNING: terminating
> connection because of crash of another server process
>
> I have two questions:
>
> 1) why in case of some problem with only one database, only one place of
> memory we have entire-server problem? The database with problem is not
> important but this corrupted memory inside it leads to frequent
> cluster-wide restart so all my server suffering from this local problem.
> Why postmaster should restart all backends if only one dies?
>

In general what this means is that the error has occurred in a "critical
section". The backend has taken a lock to protect a part of shared memory,
and (possibly) made changes that put the shared memory into an inconsistent
state, but now it cannot complete the process, putting shared memory back
into a new consistent state. It cannot release the lock it holds, because
that would allow other processes to see this inconsistent state. So
restarting entire system is the only alternative. This is drastic, and
that is why they try to make critical sections as small as possible.

It is possible that this code does not really need to be in a "critical
section", but that it is just the case that no one has done the work to
rearrange the code to take it out of the critical section.

> 2) what is the best modern way to analyze and fix such an issue?
>

Is the problem reproducible? That is, if you restore the last physical
backup of your pre-upgrade database to a test server and run pg_upgrade on
that, do you get the same problem?

Did you get a core dump out of the panic, which you can attach to with gdb
and get a backtrace? (in which case, you should probably take it to the
pgsql-hackers mailing list.)

If your top concern is getting all the other databases back as soon as
possible, you should be able to just drop the corrupted database (after
making a full backup). Then you can worry about recovering that database
and rejoining it at your leisure.

Cheers,

Jeff

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Martin Gudmundsson 2014-07-14 20:29:03 Re: BDR: Strange values in pg_stat_replication
Previous Message Vasudevan, Ramya 2014-07-14 17:41:15 Re: max_connections reached in postgres 9.3.3