***** **********<zlobnynigga(at)yandex(dot)ru> wrote:
> 17.06.2011, 00:28, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>:
>> ***** **********<zlobnynigga(at)yandex(dot)ru>; wrote:
>>
>>> [4-1] 2011-06-16 17:40:27 UTC LOG: startup process (PID 15292)
>>> was terminated by signal 7: Bus error
>>> Signal 7 means hardware problems. But all 10 replicas crashed
>>> within 10 minutes, say from 13:35 to 13:45.
>>> One important thing - all replicas and master are running on
>>> openvz
>> On the face of it, the most likely cause would seem to be
>> hardware or the virtual environment.
> I noticed that crash takes place when shared buffers are almost
> full, i.e. SELECT SUM(size) FROM adm.buffercache() returns 11670
> at about one minute before crash. Furthermore, last night I set
> buffers to 11Gb, at it is working, no crash, all buffers are used
> (11120).
Well then, in a pinch you could always fall back to using what
works.
> I still do not believe that this is hardware problem.
How would an application cause a bus error?
> Each replica and master runs on dedicated server, no hardware is
> shared.
OK. If they had been on the same blade chassis or something I would
have suspected hardware.
> There is only postgresql on each server, no any other
> software(just crond, zabbix, atop). Actually openvz is used only
> for portability(easily add new replicas or migrate one of them to
> new server).
Still, it emulates hardware, so you have to consider it a suspect
for any hardware problem -- at least if you want to solve that
problem.
> Master did not crash
Ah, that wasn't clear from the earlier post. I'm not sure how
significant it is, but it's good to know.
> I think because it processes less SELECT queries, therefore his
> buffers do not reach limit.
In your shoes I would now be trying to construct a test program to
exercise progressively larger allocations of shared memory, and test
them both under openvz and without it. Well, first I would probably
try loading the master with queries to drive it to use the full
shared_buffers space, *then* move on to the test program.
The relevant question here is why others can successfully use large
shared_buffers settings while you can't. Something is different in
your environment. What?
-Kevin