From: | Thomas O'Connell <tfo(at)monsterlabs(dot)com> |
---|---|
To: | pgsql-admin(at)postgresql(dot)org |
Subject: | Re: performance tuning: shared_buffers, sort_mem; swap |
Date: | 2002-08-13 21:23:26 |
Message-ID: | tfo-650CFB.16232613082002@news.hub.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-admin |
In article <1101(dot)1029272567(at)sss(dot)pgh(dot)pa(dot)us>,
tgl(at)sss(dot)pgh(dot)pa(dot)us (Tom Lane) wrote:
> Hmm. That's definitely a startup-time error. The only way that code
> could be executed later than postmaster startup is if you suffer a
> database crash and the postmaster is trying to reinitialize the system
> with a fresh shared-memory arena. That would say that this isn't your
> primary problem, but a consequence of a crash that'd already occurred.
Interesting. Particularly interesting because postgres actually
intelligently restarts itself after a crash under duress. We've gotten
this error every time, and postgres is always running properly after a
minute or two of downtime. I've always thought this message was why it
died in the first place, but I guess it's related to a startup failure
after the first crash, instead.
> I am curious why you'd get "Invalid argument" (EINVAL), as presumably
> these are the same arguments that the kernel accepted on the previous
> cycle of life. But that's probably not the issue to focus on.
Right. I think this is related to your speculation, below.
> If it happens to select
> a database backend to kill, the postmaster will interpret the backend's
> unexpected exit as a crash, and will force a database restart.
I guess this is what we're seeing, then. Right before the IPC error,
there are usually several of these:
"NOTICE: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend
died abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am
going to terminate your database system connection and exit.
Please reconnect to the database system and repeat your query."
I had always thought this just meant that postmaster children were
dying. Does it instead mean that the main backend server is dying
repeatedly? i.e., is this the forced database restart you mention above?
> Perhaps
> when the postmaster tries to reallocate the shmem segment a few
> milliseconds later, the kernel still thinks it's under load and rejects
> a shmem request that it'd normally have accepted. (That last bit is
> just speculation though.)
I think this is pretty good speculation, considering that after things
settle down a bit, it perks right up. Wow, this is all great stuff to
know.
> Possible solutions: (a) buy more RAM and/or increase available swap
> space (I'm not sure whether more swap, without more physical RAM,
> actually helps; anyone know?); (b) reduce peak load by reducing
> max_connections and/or scaling back your other servers; (c) switch to
> another OS --- I don't think the *BSD kernels have this brain-damaged
> idea about how to cope with low memory...
Well, our solution for the time being has been to have saner
rate-limiting so that the web server is not even able to pound the
database as much. In essence, we were experiencing DoS attacks, meaning
requests were coming several times a minute from the same IP. We still
accept a reasonable number of requests for a public web application
server, but we've managed to stop the crashing, for now.
Still, all of this is great added knowledge to the quest for better
tuning. I was under the mistaken impression that my bad memory math was
somehow responsible for postgres being the point of failure during the
stress. Lucky me, as a DBA, to learn otherwise!
Thanks!
-tfo
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2002-08-13 21:26:56 | Re: performance tuning: shared_buffers, sort_mem; swap |
Previous Message | Oleg Bartunov | 2002-08-13 21:19:51 | Re: Multiple indexes or multi-column index? |