From: | Florian Pflug <fgp(at)phlo(dot)org> |
---|---|
To: | Sergey Koposov <koposov(at)ast(dot)cam(dot)ac(dot)uk> |
Cc: | Merlin Moncure <mmoncure(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net> |
Subject: | Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile |
Date: | 2012-05-31 00:00:59 |
Message-ID: | 246CDA37-6A93-4F60-9F48-F0B43DC06AC4@phlo.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On May31, 2012, at 01:16 , Sergey Koposov wrote:
> On Wed, 30 May 2012, Florian Pflug wrote:
>>
>> I wonder if the huge variance could be caused by non-uniform synchronization costs across different cores. That's not all that unlikely, because at least some cache levels (L2 and/or L3, I think) are usually shared between all cores on a single die. Thus, a cache bouncing line between cores on the same die might very well be faster then it bouncing between cores on different dies.
>>
>> On linux, you can use the taskset command to explicitly assign processes to cores. The easiest way to check if that makes a difference is to assign one core for each connection to the postmaster before launching your test. Assuming that cpu assignment are inherited to child processes, that should then spread your backends out over exactly the cores you specify.
>
> Wow, thanks! This seems to be working to some extend. I've found that distributing each thread x ( 0<x<7) to the cpu 1+3*x
> (reminder, that i have HT disabled and in total I have 4 cpus with 6 proper cores each) gives quite good results. And after a few runs, I seem to be getting a more or less stable results for the multiple threads, with the performance of multithreaded runs going from 6 to 11 seconds for various threads. (another reminder is that 5-6 seconds is roughly the timing of a my queries running in a single thread).
Wait, so performance *increased* by spreading the backends out over as many dies as possible, not by using as few as possible? That'd
be exactly the opposite of what I'd have expected. (I'm assuming that cores on one die have ascending ids on linux. If you could post the contents of /proc/cpuinfo, we could verify that)
> So to some extend one can say that the problem is partially solved (i.e. it is probably understood)
Not quite, I think. We still don't really know why there's that much spinlock contention AFAICS. But what we've learned is that the actual
spinning on a contested lock is only part of the problem. The cache-line bouncing caused by all those lock acquisition is the other part, and it's pretty expensive too - otherwise, moving the backends around wouldn't have helped.
> But the question now is whether there is a *PG* problem here or not, or is it Intel's or Linux's problem ?
Neither Intel nor Linux can do much about this, I fear. Synchronization will always be expensive, and the more so the larger the number of cores. Linux could maybe pick a better process to core assignment, but it probably won't be able to pick the optimal one for every workload. So unfortunately, this is a postgres problem I'd say.
> Because still the slowdown was caused by locking. If there wouldn't be locking there wouldn't be any problems (as demonstrated a while ago by just cat'ting the files in multiple threads).
Yup, we'll have to figure out a way to reduce the locking overhead. 9.2 already scales much better to a large number of cores than previous versions did, but your test case shows that there's still room for improvement.
best regards,
Florian Pflug
From | Date | Subject | |
---|---|---|---|
Next Message | Jeff Janes | 2012-05-31 00:03:00 | Re: Figuring out shared buffer pressure |
Previous Message | Andres Freund | 2012-05-30 23:43:07 | Re: WalSndWakeup() and synchronous_commit=off |