From: | Maksim Milyutin <milyutinma(at)gmail(dot)com> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, sk(at)zsrv(dot)org, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org> |
Subject: | Re: BUG #15585: infinite DynamicSharedMemoryControlLock waiting in parallel query |
Date: | 2019-01-16 14:23:01 |
Message-ID: | 458c691a-1b2d-7a70-1c04-58361b894011@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On 1/11/19 8:16 AM, Thomas Munro wrote:
> On Fri, Jan 11, 2019 at 10:04 AM Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>> I am trying to reproduce it.
> I ran squillions of concurrent parallel queries today, making sure
> they would allocate and free entire segments a lot (based on the
> assumption that something along those lines must be required). I made
> sure to use Linux, GCC, -O2, no asserts (since both reports came from
> that environment, and I'd previously failed to reproduce this on my
> usual tool chain/platforms), and I used a multi-socket box (in case
> some cache coherency effect was not occurring on my development
> laptops). I did learn some interesting things about parallel query
> performance that I plan to follow up on, but I had no luck in
> reproducing this error. Rats.
>
> One observation is that the other report involved a fairly simple
> Parallel Hash Join query (on 11), and this report involves Parallel
> Index Scan and Parallel Bitmap Index Scan (on 10), so that suggests
> that it's probably not a bug in the parallel executor code (for
> example an access-after-free, whose undefined behaviour could in
> theory look like this with unlucky timing, I speculate) but rather
> something lower level.
Recently one of our customers encountered the same problem on PG 11.1.
The query structure and error messages are similar to one that was
described in [1].
OS/PG settings:
Ubuntu 16.04
mem 128 GB
cpu 16 cores
huge_pages is turned off
# sudo cat /proc/sys/kernel/shmmax
18446744073692774399
max_connections = 200
shared_buffers = 33GB
dynamic_shared_memory_type = posix
max_worker_processes = 16
max_parallel_maintenance_workers = 6
max_parallel_workers_per_gather = 6
max_parallel_workers = 12
The query was:
SELECT
ev.date,
r.indicator_id,
r.service_reg_id,
count(r.value),
count(CASE WHEN r.value::int > 3 THEN 1 ELSE NULL END),
avg(r.value::int)
FROM base.hershel_events AS ev
INNER JOIN base.hershel_rates AS r ON r.event_id = ev.id
WHERE ev.date between '2018-09-15' AND '2018-12-15'
GROUP BY ev.date, r.indicator_id, r.service_reg_id
ORDER BY ev.date ASC
Its plan with record statistics:
Finalize GroupAggregate (actual rows=81332937 loops=1)
Group Key: ev.date, r.indicator_id, r.service_reg_id
-> Gather Merge (actual rows=81332937 loops=1)
Workers Planned: 7
Workers Launched: 7
-> Partial GroupAggregate (actual rows=10166617 loops=8)
Group Key: ev.date, r.indicator_id, r.service_reg_id
-> Sort (actual rows=10166617 loops=8)
Sort Key: ev.date, r.indicator_id, r.service_reg_id
Sort Method: external merge Disk: 549392kB
Worker 0: Sort Method: external merge Disk: 516104kB
Worker 1: Sort Method: external merge Disk: 533200kB
Worker 2: Sort Method: external merge Disk: 540680kB
Worker 3: Sort Method: external merge Disk: 540760kB
Worker 4: Sort Method: external merge Disk: 530512kB
Worker 5: Sort Method: external merge Disk: 472240kB
Worker 6: Sort Method: external merge Disk: 539256kB
-> Parallel Hash Join (actual rows=10166617 loops=8)
Hash Cond: (r.event_id = ev.id)
-> Parallel Seq Scan on hershel_rates r
(actual rows=15000000 loops=8)
-> Parallel Hash (actual rows=9319175 loops=8)
Buckets: 4194304 Batches: 32 Memory
Usage: 124064kB
-> Parallel Seq Scan on
hershel_events ev (actual rows=9319175 loops=8)
Filter: ((date >=
'2018-09-15'::date) AND (date <= '2018-12-15'::date))
Rows Removed by Filter: 4430825
At some moment one of parallel workers emitted to log the error message:
dsa_area could not attach to segment. After terminating all workers (the
leader also bumped into the same error) the last one began to emit to
log the following type of messages: LOG: temporary file: path
"base/pgsql_tmp/pgsql_tmp5027.0.sharedfileset/i55of128.p1.0", size
458752. The summary size of temp files that was mentioned in the log was
about *14GB*. And eventually that last worker got the fatal error:
cannot unpin a segment that is not pinned. After that the
DynamicSharedMemoryControlLock left unreleased and soon new backends
hanged on it.
Another time some backend (more likely, it was some utility worker)
bumped into the error: could not open shared memory segment
"/PostgreSQL.1948558832": No such file or directory. After that other
backends stopped on DynamicSharedMemoryControlLock. Before this issue
there were parallel aggregate queries.
Unfortunately our customer doesn't want to catch up this error on
production system again. He turned off the query parallelism. And this
issue doesn't reproduce on test environment. AFAIK this is heisenbug and
is reproducible on heavy load. I'll try to run some stress tests to
catch up stack trace.
[1]
https://www.postgresql.org/message-id/flat/20181231221734.GB25379%40telsasoft.com
<https://www.postgresql.org/message-id/flat/20181231221734(dot)GB25379(at)telsasoft(dot)com>
--
Regards, Maksim Milyutin
Postgres Professional:http://www.postgrespro.ru
The Russian Postgres Company
From | Date | Subject | |
---|---|---|---|
Next Message | PG Bug reporting form | 2019-01-16 14:56:34 | BUG #15594: Unstable tests in contrib/test_decoding/output_iso/ |
Previous Message | Amit Langote | 2019-01-16 09:30:02 | Re: BUG #15587: Partitions with ALTER TABLE ADD CONSTRAINT |