Re: How to cripple a postgres server

From: Stephen Robert Norris <srn(at)commsecure(dot)com(dot)au>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: How to cripple a postgres server
Date: 2002-05-30 01:11:36
Message-ID: 1022721096.6066.36.camel@ws12
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, 2002-05-30 at 01:52, Tom Lane wrote:
> I spent some time this morning trying to reproduce your problem, with
> not much luck. I used the attached test program, in case anyone else
> wants to try --- it fires up the specified number of connections
> (issuing a trivial query on each one, just so that the backend is not
> completely virgin) and goes to sleep. I ran that in one window and did
> manual "vacuum full"s in psql in another window. I was doing the
> vacuums in the regression database which has about 150 tables, so there
> was an SI overrun event (resulting in SIGUSR2) every third or so vacuum.
>
> Using stock Red Hat Linux 7.2 (kernel 2.4.7-10) on a machine with 256M
> of RAM, I was able to run up to about 400 backends without seeing much
> of any performance problem. (I had the postmaster running with
> postmaster -i -F -N 1000 -B 2000 and defaults in postgresql.conf.)
> Each SI overrun fired up all the idle backends, but they went back to
> sleep after a couple of kernel calls and not much computation.

Similar setup here, but 1GB RAM. If this problem is some sort of O(n^2)
thing, it could well be the case that it only happens on (for example) >
600 backends, and is fine at 400...

I also wonder if SMP has any impact - if there's lots of semops going
on, and the memory is being thrashed between CPU caches, that won't be
nice...

> Above 500 backends the thing went into swap hell --- it took minutes of
> realtime to finish out the SI overrun cycle, even though the CPU was
> idle (waiting for swap-in) most of the time.

I never swap.

Some more data from this end - I have only managed to reproduce the
problem once in about 2 hours with those lines removed that you asked me
to remove yesterday. With the lines still in, the problem happens after
a minute or two pretty much every time.

I still see the high numbers of processes in the run queue, and the load
rises, but neither postgres nor the machine stalls.

> What does your strace look like?
>
> regards, tom lane

In "normal" SI overruns, about the same:

--- SIGUSR2 (User defined signal 2) ---
gettimeofday({1022719053, 355014}, NULL) = 0
close(7) = 0
close(6) = 0
close(4) = 0
close(3) = 0
close(9) = 0
semop(6258745, 0xbfffeb04, 1) = 0
semop(6193207, 0xbfffeb04, 1) = 0
open("/var/lib/pgsql/data/base/504592641/1259", O_RDWR) = 3
open("/var/lib/pgsql/data/base/504592641/16429", O_RDWR) = 4
semop(6258745, 0xbfffe8e4, 1) = 0
semop(6225976, 0xbfffe8e4, 1) = 0
open("/var/lib/pgsql/data/base/504592641/1249", O_RDWR) = 6
open("/var/lib/pgsql/data/base/504592641/16427", O_RDWR) = 7
open("/var/lib/pgsql/data/base/504592641/16414", O_RDWR) = 9
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={1, 0}},
{it_interval={0, 0}, it_value={0, 0}}) = 0
semop(6258745, 0xbfffea24, 1) = 0
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}},
{it_interval={0, 0}, it_value={0, 870000}}) = 0
lseek(9, 0, SEEK_END) = 0
semop(4456450, 0xbfffeac4, 1) = 0
sigreturn() = ? (mask now [])
recv(8, 0x839a0a0, 8192, 0) = ? ERESTARTSYS (To be
restarted)

Although, I see anything up to 9 or even 15 semop() calls and file
close/open pairs.

When it went mad, this happened:

--- SIGUSR2 (User defined signal 2) ---
gettimeofday({1022720979, 494838}, NULL) = 0
semop(10551353, 0xbfffeb04, 1) = 0
close(7) = 0
close(6) = 0
close(4) = 0
close(3) = 0
select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout)
close(9) = 0
open("/var/lib/pgsql/data/base/504592641/1259", O_RDWR) = 3
select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout)
open("/var/lib/pgsql/data/base/504592641/16429", O_RDWR) = 4
open("/var/lib/pgsql/data/base/504592641/1249", O_RDWR) = 6
open("/var/lib/pgsql/data/base/504592641/16427", O_RDWR) = 7
open("/var/lib/pgsql/data/base/504592641/16414", O_RDWR) = 9
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={1, 0}},
{it_interval={0, 0}, it_value={0, 0}}) = 0
semop(10551353, 0xbfffea24, 1) = -1 EINTR (Interrupted system
call)
--- SIGALRM (Alarm clock) ---
semop(10551353, 0xbfffe694, 1) = 0
semop(8716289, 0xbfffe694, 1) = 0
sigreturn() = ? (mask now [USR2])

However, the strace stopped just before the ) on the first semop, which
I think means it hadn't completed. The whole thing (postgres, vmstat and
all) stopped for about 10 seconds, then it went on.

This was only a short version of the problem (it can lock up for 20-30
seconds), but I think it's the same thing.

Stephen

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Joshua Drake 2002-05-30 10:47:07 Actual Marketing happening
Previous Message Richard Poole 2002-05-30 00:14:46 Re: Query plan w/ like clause question