Quick Links

Re: Two Necessary Kernel Tweaks for Linux Systems

From:	Henri Philipps <henri(dot)philipps(at)gmail(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Two Necessary Kernel Tweaks for Linux Systems
Date:	2013-01-10 08:51:26
Message-ID:	CABvEAQs7cRUH9PLHyTo-F+0t8=iKgX8N3ZSfCuzeSWXBJ9hM3w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

Hi,

we also hit this performance barrier a while ago, when migrating a
database on a big server (48 core Opteron, 512GB RAM) from Kernel
2.6.32 to 3.2 (both kernels from Debian packages). The system load was
getting very high, as you also observed (don't know the exact numbers
right now).

After some investigation I found out, that the reason for the high
system load was that the postgresql processes were migrating from core
to core at very high rates. So the behaviour of the CFS scheduler must
have changed in this regard between 2.6.32 and 3.2 kernels.

You can easily see this, if you have a look how much time the
migration kernel threads spend in the CPU (ps ax | grep migration). A
look into /proc/sched_debug also can give you some more insight into
the scheduler behaviour.

On NUMA systems the scheduler tries to migrate processes to the nodes
on which they have the best memory-locality. But on a big database one
process is typically reading randomly from a dataset which is spread
above all nodes. On newer kernels the CFS scheduler seems to try more
aggressively to migrate processes to other cores. I don't know if it
is for better load balancing or for better memory locality. But
process migrations are consuming a lot of resources.

I had to change sched_migration_costs from 500000 (0.5ms) to 100000000
(100ms). This means, the scheduler is only considering a task for
migration if the task was running at least for 100ms instead of 0.5ms.
This solved the problem for us - the migration kernel threads didn't
have to do much work anymore and thus the system load was going down
again.

A general problem is, that the CFS scheduler has a lot of changes
between all kernel versions, so it is really hard to predict which
regressions you can hit when going to another kernel version.
Scheduling on NUMA systems is also very complex.

An interesting dissertations showing the inconsistent behaviour of the
CFS scheduler:
http://research.cs.wisc.edu/adsl/Publications/meehean-thesis11.pdf

Some parameters, which also could be considered for systematic benchmarking are

sched_latency_ns
sched_min_granularity_ns

I guess that higher numbers could improve performance too on systems
with many cores and many connections.

Thanks for starting this interesting thread!

Henri

In response to

Re: Two Necessary Kernel Tweaks for Linux Systems at 2013-01-08 19:32:14 from Shaun Thomas

Responses

Re: Two Necessary Kernel Tweaks for Linux Systems at 2013-01-10 15:53:25 from Shaun Thomas
autovacuum fringe case? at 2013-01-23 16:53:57 from AJ Weber

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Andrzej Zawadzki	2013-01-10 13:32:05	Slow query after upgrade from 9.0 to 9.2
Previous Message	Merlin Moncure	2013-01-09 15:53:08	Re: Simple join doesn't use index