From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Gather performance analysis |
Date: | 2021-09-07 21:38:44 |
Message-ID: | 20210907213844.wyln22fasij54ltz@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2021-08-06 14:00:48 +0530, Dilip Kumar wrote:
> --Setup
> SET parallel_tuple_cost TO 0 -- to test parallelism in the extreme case
> CREATE TABLE t (a int, b varchar);
> INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i;
> ANALYZE t;
> Test query: EXPLAIN ANALYZE SELECT * FROM t;
>
> Perf analysis: Gather Node
> - 43.57% shm_mq_receive
> - 78.94% shm_mq_receive_bytes
> - 91.27% pg_atomic_read_u64
> - pg_atomic_read_u64_impl
> - apic_timer_interrupt
> smp_apic_timer_interrupt
>
> Perf analysis: Worker Node
> - 99.14% shm_mq_sendv
> - 74.10% shm_mq_send_bytes
> + 42.35% shm_mq_inc_bytes_written
> - 32.56% pg_atomic_read_u64
> - pg_atomic_read_u64_impl
> - 86.27% apic_timer_interrupt
> + 17.93% WaitLatch
>
> From the perf results and also from the code analysis I can think of
> two main problems here
Looking at this profile made me wonder if this was a build without
optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls should
be inlined. And while perf can reconstruct inlined functions when using
--call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)" for me.
FWIW, I see times like this
postgres[4144648][1]=# EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Gather (cost=1000.00..6716686.33 rows=200000000 width=208) (actual rows=200000000 loops=1) │
│ Workers Planned: 2 │
│ Workers Launched: 2 │
│ -> Parallel Seq Scan on t (cost=0.00..6715686.33 rows=83333333 width=208) (actual rows=66666667 loops=3) │
│ Planning Time: 0.043 ms │
│ Execution Time: 24954.012 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(6 rows)
Looking at a profile I see the biggest bottleneck in the leader (which is the
bottleneck as soon as the worker count is increased) to be reading the length
word of the message. I do see shm_mq_receive_bytes() in the profile, but the
costly part there is the "read % (uint64) ringsize" - divisions are slow. We
could just compute a mask instead of the size.
We also should probably split the read-mostly data in shm_mq (ring_size,
detached, ring_offset, receiver, sender) into a separate cacheline from the
read/write data. Or perhaps copy more info into the handle, particularly the
ringsize (or mask).
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Victor Spirin | 2021-09-07 21:40:11 | Re: Atomic rename feature for Windows. |
Previous Message | Melanie Plageman | 2021-09-07 20:16:28 | Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?) |