Re: [ADMIN]openvz and shared memory trouble

From: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
To: Willy-Bas Loos <willybas(at)gmail(dot)com>
Cc: lst_hoe02(at)kwsoft(dot)de, pgsql-admin <pgsql-admin(at)postgresql(dot)org>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: [ADMIN]openvz and shared memory trouble
Date: 2014-03-31 14:38:31
Message-ID: 53397DE7.2070903@aklaver.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-general

On 03/31/2014 04:12 AM, Willy-Bas Loos wrote:
>
> On Sat, Mar 29, 2014 at 6:17 PM, Adrian Klaver
> <adrian(dot)klaver(at)aklaver(dot)com <mailto:adrian(dot)klaver(at)aklaver(dot)com>> wrote:
>
> On 03/29/2014 08:19 AM, Willy-Bas Loos wrote:
>
> The error that shows up is a Bus error.
> That's on the replication slave.
> Here's the log about it:
> 2014-03-29 12:41:33 CET db: ip: us: FATAL: could not receive
> data from
> WAL stream: server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
>
> cp: cannot stat
> `/data/postgresql/9.1/main/__wal_archive/__00000001000000720000000A':
> No
> such file or directory
> 2014-03-29 12:41:33 CET db: ip: us: LOG: unexpected pageaddr
> 71/E9DA0000 in log file 114, segment 10, offset 14286848
> cp: cannot stat
> `/data/postgresql/9.1/main/__wal_archive/__00000001000000720000000A':
> No
> such file or directory
> 2014-03-29 12:41:33 CET db: ip: us: LOG: streaming replication
> successfully connected to primary
> 2014-03-29 12:41:48 CET db: ip: us: LOG: startup process (PID
> 17452)
> was terminated by signal 7: Bus error
> 2014-03-29 12:41:48 CET db: ip: us: LOG: terminating any other
> active
> server processes
> 2014-03-29 12:41:48 CET db:wbloos ip:[local] us:wbloos WARNING:
> terminating connection because of crash of another server process
> 2014-03-29 12:41:48 CET db:wbloos ip:[local] us:wbloos DETAIL: The
> postmaster has commanded this server process to roll back the
> current
> transaction and exit, because another server process exited
> abnormally
> and possibly corrupted shared memory.
> 2014-03-29 12:41:48 CET db:wbloos ip:[local] us:wbloos HINT: In a
> moment you should be able to reconnect to the database and
> repeat your
> command.
>
>
> Well what I am seeing are WAL log errors. One saying no file is
> present, the other pointing at a possible file corruption.
>
> Those are normal notices, nothing to worry about.

Well other then they cause the standby to reconnect to the primary,
during which a crash occurs.

>
> Shared memory problems are offered as a possible cause only. Right
> now I would say we are seeing only half the picture. The Postgres
> logs from the same time period for the primary server, as well as
> the system logs for the openvz container would help fill in the
> other half of the picture.
>
>
> Here's the log from the primary postgres server:
> 2014-03-29 12:41:29 CET db:wbloos ip:[local] us:wbloos NOTICE: ALTER
> TABLE will create implicit sequence "test_x_seq" for serial column "test.x"
> 2014-03-29 12:41:33 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication
> LOG: SSL renegotiation failure
> 2014-03-29 12:41:33 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication
> LOG: SSL error: unexpected record
> 2014-03-29 12:41:33 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication
> LOG: could not send data to client: Connection reset by peer
> 2014-03-29 12:41:48 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication
> LOG: could not receive data from client: Connection reset by peer
> 2014-03-29 12:41:48 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication
> LOG: unexpected EOF on standby connection
>
> (the SSL renegotiation failure happens all the time, without the crash)
>
> And here's the syslog form the container:
> Mar 29 12:41:01 mycontainer snmpd[8819]: Connection from UDP:
> [xxx.xxx.xxx.xxx]:59090->[xxx.xxx.xxx.xxx]
> Mar 29 12:42:30 mycontainer snmpd[8819]: Connection from UDP:
> [xxx.xxx.xxx.xxx]:35949->[xxx.xxx.xxx.xxx]
>
> The log on the host doesn't say anything interesting either.
>
> A cursory look at memory management in openvz shows it is different
> from other virtualization software and physical machines. Whether
> that is a problem would seem to be dependent on where you are on the
> learning curve:)
>
> That sounds like "there is a solution to the problem, all you have to do
> is find out what it is". There doesn't seem to be a variable in the
> beancounters or anywhere else that can prevent the bus error from happening.
> There's seems to be no separate way of guaranteeing shared memory.
> There's no OOM killer active either, nor is host or server running short
> of memory.

At this point I am not sure it is even obvious what is causing the
error, so finding a solution would be a hit or miss affair at best.

>
> I'm still worried that it's like Tom Lane said in another discussion:"So
> basically, you've got a broken kernel here: it claimed to give PG circa
> (135MB) of memory, but what's actually there is only about (128MB). I
> don't see any connection between those numbers and the shmmax/shmall
> settings, either --- so I think this must be some busted implementation
> of a VM-level limitation."
> (here:
> http://www.postgresql.org/message-id/CAK3UJREBcyVBtr8D7vMfU=uDdkjXkrPnGcuy8erYB0tMfKe1LA@mail.gmail.com)
>
> And it makes me wonder what else may be issues that arise from that. But
> especially, what i can do about it.

I do not use openvz so I do not have a test bed to try out, but this
page seems to be related to your problem:

http://openvz.org/Resource_shortage

or if you want more detail and a link to what looks to a replacement for
beancounters:

http://openvz.org/Setting_UBC_parameters

>
> Cheers,
>
> WBL
>
> --
> "Quality comes from focus and clarity of purpose" -- Mark Shuttleworth

--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Jason Mathis 2014-03-31 14:40:55 Re: PostgreSQL 9.3 logging: separate log messages
Previous Message Willy-Bas Loos 2014-03-31 13:49:38 Re: [GENERAL] openvz and shared memory trouble

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2014-03-31 14:55:57 Re: [ADMIN]openvz and shared memory trouble
Previous Message Willy-Bas Loos 2014-03-31 13:49:38 Re: [GENERAL] openvz and shared memory trouble