From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Fast DSM segments |
Date: | 2020-04-09 05:45:25 |
Message-ID: | CA+hUKGLAE2QBv-WgGp+D9P_J-=yne3zof9nfMaqq1h3EGHFXYQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello PostgreSQL 14 hackers,
FreeBSD is much faster than Linux (and probably Windows) at parallel
hash joins on the same hardware, primarily because its DSM segments
run in huge pages out of the box. There are various ways to convince
recent-ish Linux to put our DSMs on huge pages (see below for one),
but that's not the only problem I wanted to attack.
The attached highly experimental patch adds a new GUC
dynamic_shared_memory_main_size. If you set it > 0, it creates a
fixed sized shared memory region that supplies memory for "fast" DSM
segments. When there isn't enough free space, dsm_create() falls back
to the traditional approach using eg shm_open(). This allows parallel
queries to run faster, because:
* no more expensive system calls
* no repeated VM allocation (whether explicit posix_fallocate() or first-touch)
* can be in huge pages on Linux and Windows
This makes lots of parallel queries measurably faster, especially
parallel hash join. To demonstrate with a very simple query:
create table t (i int);
insert into t select generate_series(1, 10000000);
select pg_prewarm('t');
set work_mem = '1GB';
select count(*) from t t1 join t t2 using (i);
Here are some quick and dirty results from a Linux 4.19 laptop. The
first column is the new GUC, and the last column is from "perf stat -e
dTLB-load-misses -p <backend>".
size huge_pages time speedup TLB misses
0 off 2.595s 9,131,285
0 on 2.571s 1% 8,951,595
1GB off 2.398s 8% 9,082,803
1GB on 1.898s 37% 169,867
You can get some of this speedup unpatched on a Linux 4.7+ system by
putting "huge=always" in your /etc/fstab options for /dev/shm (= where
shm_open() lives). For comparison, that gives me:
size huge_pages time speedup TLB misses
0 on 2.007s 29% 221,910
That still leave the other 8% on the table, and in fact that 8%
explodes to a much larger number as you throw more cores at the
problem (here I was using defaults, 2 workers). Unfortunately, dsa.c
-- used by parallel hash join to allocate vast amounts of memory
really fast during the build phase -- holds a lock while creating new
segments, as you'll soon discover if you test very large hash join
builds on a 72-way box. I considered allowing concurrent segment
creation, but as far as I could see that would lead to terrible
fragmentation problems, especially in combination with our geometric
growth policy for segment sizes due to limited slots. I think this is
the main factor that causes parallel hash join scalability to fall off
around 8 cores. The present patch should really help with that (more
digging in that area needed; there are other ways to improve that
situation, possibly including something smarter than a stream of of
dsa_allocate(32kB) calls).
A competing idea would have freelists of lingering DSM segments for
reuse. Among other problems, you'd probably have fragmentation
problems due to their differing sizes. Perhaps there could be a
hybrid of these two ideas, putting a region for "fast" DSM segments
inside many OS-supplied segments, though it's obviously much more
complicated.
As for what a reasonable setting would be for this patch, well, erm,
it depends. Obviously that's RAM that the system can't use for other
purposes while you're not running parallel queries, and if it's huge
pages, it can't be swapped out; if it's not huge pages, then it can be
swapped out, and that'd be terrible for performance next time you need
it. So you wouldn't want to set it too large. If you set it too
small, it falls back to the traditional behaviour.
One argument I've heard in favour of creating fresh segments every
time is that NUMA systems configured to prefer local memory allocation
(as opposed to interleaved allocation) probably avoid cross node
traffic. I haven't looked into that topic yet; I suppose one way to
deal with it in this scheme would be to have one such region per node,
and prefer to allocate from the local one.
Attachment | Content-Type | Size |
---|---|---|
0001-Support-DSM-segments-in-the-main-shmem-area.patch | text/x-patch | 16.1 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2020-04-09 05:52:17 | Re: Vacuum o/p with (full 1, parallel 0) option throwing an error |
Previous Message | Tom Lane | 2020-04-09 05:36:42 | Re: [HACKERS] advanced partition matching algorithm for partition-wise join |