Re: Debugging shared memory issues on CentOS

From: Mack Talcott <mack(dot)talcott(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Debugging shared memory issues on CentOS
Date: 2013-12-11 08:23:29
Message-ID: CAPZbZCDEGY9nPtFOrXqYynt4oOk9RRsSiR+G_4BUNQmK5QK6pA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Tue, Dec 10, 2013 at 8:54 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Mack Talcott <mack(dot)talcott(at)gmail(dot)com> writes:
>> I am trying to debug some shared memory issues with Postgres 9.3.1 and
>> CentOS release 6.3 (Final). I have a database machine that probably has
>> some misconfigured shared memory settings. It's getting into 2+ GB of
>> swap. Restarting postgres frees all of the memory, but after a few hours
>> of normal usage it will go back into swap.
>
> Are you sure the kernel isn't just swapping out some idle processes
> because it feels like it? These numbers don't exactly look like a
> machine under stress:
>
>> top - 09:38:16 up 1 day, 21:21, 3 users, load average: 0.40, 0.54, 0.45
>> Tasks: 253 total, 2 running, 251 sleeping, 0 stopped, 0 zombie
>> Cpu(s): 0.7%us, 0.2%sy, 0.0%ni, 97.8%id, 1.2%wa, 0.0%hi, 0.0%si,
>> 0.0%st
>> Mem: 6998260k total, 6849048k used, 149212k free, 248k buffers
>> Swap: 440478516k total, 1981912k used, 438496604k free, 1541356k cached
>
> In particular, you've got 1.5 gig of filesystem cache, so you're hardly
> out of memory. I don't know where the other 5.5 gig of RAM went, but
> it doesn't look like postgres is eating it; what else is running on
> this box?
>
> These lines look absolutely normal, assuming that you've configured
> shared_buffers somewhere in the neighborhood of 1GB:
>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 3534 postgres 20 0 2330m 1.4g 1.1g S 0.0 20.4 1:06.99 postgres:
>> deploy mtalcott 10.222.154.172(53495) idle
>> 9143 postgres 20 0 2221m 1.1g 983m S 0.0 16.9 0:14.75 postgres:
>> deploy mtalcott 10.222.154.167(35811) idle
>> 6026 postgres 20 0 2341m 1.1g 864m S 0.0 16.4 0:46.56 postgres:
>> deploy mtalcott 10.222.154.167(37110) idle
>> 18538 postgres 20 0 2327m 1.1g 865m S 0.0 16.1 2:06.59 postgres:
>> deploy mtalcott 10.222.154.172(47796) idle
>> 1575 postgres 20 0 2358m 1.1g 858m S 0.0 15.9 1:41.76 postgres:
>> deploy mtalcott 10.222.154.172(52560) idle
>
> The key thing to realize about that is that the SHR column is *shared*
> memory, ie all these processes are referencing the same chunk of about 1GB
> worth of memory. The process-specific memory is RES minus SHR, and none
> of those processes seem tremendously out of line on that measure. (Note:
> the fact that the SHR values aren't all exactly the same is because top
> doesn't count a shared page until the process has physically touched that
> page. Even the guy with 1.1g of SHR might not have touched all of the
> shared storage yet.)
>
> I'm not sure you have a problem here. If you do, these figures aren't
> showing it. Having some stuff shoved out to swap is not a problem unless
> you have a problem with the swap I/O rate. You might try watching "vmstat
> 1" for awhile to see if the si/so columns show significant activity.
>
> regards, tom lane

Thanks for your reply. I've included the rest of the top output
below. This is a dedicated postgres box, so nothing else is running.

shared_buffers is set to 1.8g, to accommodate some of our larger
operations. It looks like this could be lowered a bit, since the max
shared usage is only 1.1g.

The pattern I am seeing is that postgres processes keep growing in
shared (this makes sense as they access more of the shared memory, as
you've pointed out) but also process-specific memory as they run more
queries. The largest ones are using around 300mb of process-specific
memory, even when they're idle and outside of any transactions.

As for CentOS using 1.5g for disk caching, I'm at a loss. I have
played with the 'swappiness', setting it down to 10 from the default
of 60 with sysctl. It didn't have any effect.

Once 70-80% of memory is reached, the machine starts using swap, and
it keeps growing. At first, queries become slightly slower. Then
some larger selects start taking 10, then 20, then 30 seconds. During
this, vmstat shows 5-20 procs waiting on both CPU and I/O. All of a
sudden, generally after some large transaction, about 1g of swap is
released and the number of blocked procs jumps to 50-80. Everything
grinds to a halt for a few minutes. Sometimes my app can recover, and
sometimes it needs a little kick.

As expected, resetting the connection clears the process-specific
memory. The same number of connections on the same machine only use
20% of memory (with 0 swap) when I periodically reconnect. What kind
of information are these processes holding on to? I would expect
long-running, idle postgres processes to have similar memory usage to
brand new, idle ones.

One thing worth mentioning is that I am heavily using schemas. On
every request, I am setting and resetting search_path.

This top was captured just before swap was released
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3534 postgres 20 0 2330m 1.4g 1.1g S 0.0 20.4 1:06.99 postgres:
deploy mtalcott 10.222.155.179(53495) idle
9143 postgres 20 0 2221m 1.1g 983m S 0.0 16.9 0:14.75 postgres:
deploy mtalcott 10.222.155.164(35811) idle
6026 postgres 20 0 2341m 1.1g 864m S 0.0 16.4 0:46.56 postgres:
deploy mtalcott 10.222.155.164(37110) idle
18538 postgres 20 0 2327m 1.1g 865m S 0.0 16.1 2:06.59 postgres:
deploy mtalcott 10.222.155.179(47796) idle
1575 postgres 20 0 2358m 1.1g 858m S 0.0 15.9 1:41.76 postgres:
deploy mtalcott 10.222.155.179(52560) idle
17931 postgres 20 0 2343m 1.1g 834m S 0.0 15.8 2:04.61 postgres:
deploy mtalcott 10.222.155.164(54439) idle
18286 postgres 20 0 2363m 1.0g 797m S 1.3 15.6 1:54.97 postgres:
deploy mtalcott 10.222.155.179(47588) idle
4541 postgres 20 0 2343m 1.0g 783m S 0.0 15.2 1:20.75 postgres:
deploy mtalcott 10.222.155.179(53938) idle
18763 postgres 20 0 2347m 1.0g 772m S 0.0 14.9 1:49.83 postgres:
deploy mtalcott 10.222.155.164(32853) idle
1088 postgres 20 0 2336m 1.0g 778m S 0.3 14.9 1:35.40 postgres:
deploy mtalcott 10.222.155.179(52312) idle
17933 postgres 20 0 2343m 996m 800m S 0.0 14.6 2:11.68 postgres:
deploy mtalcott 10.222.155.164(54443) idle
1089 postgres 20 0 2310m 970m 776m S 1.7 14.2 1:18.34 postgres:
deploy mtalcott 10.222.155.164(46130) idle
3535 postgres 20 0 2354m 950m 779m S 0.0 13.9 1:18.44 postgres:
deploy mtalcott 10.222.155.164(33599) idle
1708 postgres 20 0 2308m 940m 760m S 0.0 13.8 1:08.72 postgres:
deploy mtalcott 10.222.155.164(49552) idle
18540 postgres 20 0 2337m 932m 784m S 0.7 13.6 1:50.66 postgres:
deploy mtalcott 10.222.155.164(59856) idle
8471 postgres 20 0 2312m 683m 429m S 0.0 10.0 0:54.35 postgres:
deploy mtalcott 10.222.155.179(57867) idle
5931 postgres 20 0 2327m 589m 340m S 0.0 8.6 0:40.07 postgres:
deploy mtalcott 10.222.155.179(55092) idle
6070 postgres 20 0 2306m 568m 358m S 0.0 8.3 0:42.56 postgres:
deploy mtalcott 10.222.155.179(55307) idle
9135 postgres 20 0 2235m 523m 341m S 0.0 7.7 0:19.65 postgres:
deploy mtalcott 10.222.155.164(35140) idle
10996 postgres 20 0 2103m 229m 169m S 0.0 3.4 0:01.65 postgres:
deploy mtalcott 10.222.155.179(60798) idle
11001 postgres 20 0 2062m 163m 144m S 0.7 2.4 0:01.90 postgres:
deploy mtalcott 10.222.155.164(44039) idle
17697 postgres 20 0 2038m 151m 150m S 0.0 2.2 0:09.82 postgres:
checkpointer process
10869 postgres 20 0 2045m 82m 76m S 3.3 1.2 0:12.19 postgres:
deploy mtalcott 10.197.52.158(43556) idle in transaction
10994 postgres 20 0 2052m 61m 50m S 0.0 0.9 0:00.77 postgres:
deploy mtalcott 10.222.155.179(60757) idle
17680 postgres 20 0 2037m 37m 37m S 0.0 0.6 0:03.34
/usr/local/pgsql9.3/bin/postgres -D /db/pgsql/9.3/data
17698 postgres 20 0 2038m 36m 35m S 0.0 0.5 0:02.85 postgres:
writer process
10993 postgres 20 0 2045m 29m 22m S 0.0 0.4 0:00.26 postgres:
deploy mtalcott 10.222.155.164(42908) idle
17701 postgres 20 0 134m 21m 272 S 0.0 0.3 1:21.61 postgres:
stats collector process
4905 postgres 20 0 2045m 13m 8408 S 0.0 0.2 0:00.44 postgres:
deploy mtalcott 10.222.155.164(47193) idle
5041 postgres 20 0 2044m 13m 8124 S 0.0 0.2 0:00.54 postgres:
deploy mtalcott 10.222.155.164(49813) idle
5036 postgres 20 0 2044m 12m 7808 S 0.0 0.2 0:00.50 postgres:
deploy mtalcott 10.222.155.164(49380) idle
6452 postgres 20 0 2044m 10m 6112 S 0.0 0.2 0:00.26 postgres:
deploy mtalcott 10.222.155.164(44313) idle
5023 postgres 20 0 2044m 10m 5868 S 0.0 0.2 0:00.50 postgres:
deploy mtalcott 10.222.155.164(47882) idle
5029 postgres 20 0 2045m 10m 6732 S 0.0 0.1 0:00.81 postgres:
deploy mtalcott 10.222.155.164(48498) idle
5808 postgres 20 0 2044m 9408 7040 S 0.0 0.1 0:00.30 postgres:
deploy mtalcott 10.222.155.164(33987) idle
17700 postgres 20 0 2039m 4728 4432 S 0.0 0.1 0:00.71 postgres:
autovacuum launcher process
10567 deploy 20 0 97820 1372 432 S 0.0 0.0 0:00.02 sshd: deploy(at)pts/2
10564 root 20 0 97820 1192 284 S 0.0 0.0 0:00.04 sshd: deploy [priv]
10998 deploy 20 0 15168 1044 604 R 0.7 0.0 0:00.59 top -c

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Michael Sacket 2013-12-11 15:41:07 When is a query slow?
Previous Message David Johnston 2013-12-11 07:18:38 Re: Problem with slow query with WHERE conditions with OR clause on primary keys