Quick Links

Re: Why JIT speed improvement is so modest?

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc:	Merlin Moncure <mmoncure(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Why JIT speed improvement is so modest?
Date:	2019-11-27 16:05:04
Message-ID:	20191127160504.uxw7tzbd2gwzwsov@development
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Nov 27, 2019 at 06:38:45PM +0300, Konstantin Knizhnik wrote:
>
>
>On 25.11.2019 18:24, Merlin Moncure wrote:
>>On Mon, Nov 25, 2019 at 9:09 AM Konstantin Knizhnik
>><k(dot)knizhnik(at)postgrespro(dot)ru> wrote:
>>>JIT was not able to significantly (times) increase speed on Q1 query?
>>>Experiment with VOPS shows that used aggregation algorithm itself is not
>>>a bottleneck.
>>>Profile also give no answer for this question.
>>>Any ideas?
>>Well, in the VOPS variant around 2/3 of the time is spent in routines
>>that are obviously aggregation. In the JIT version, it's around 20%.
>>So this suggests that the replacement execution engine is more
>>invasive. I would also guess (!) that the VOPS engine optimizes fewer
>>classes of query plan. ExecScan for example, looks to be completely
>>optimized out VOPS but is still utilized in the JIT engine.
>
>The difference in fraction of time spent in aggregate calculation is
>not so large (2 times vs. 10 times).
>I suspected that a lot of time is spent in relation traversal code,
>tuple unpacking and visibility checks.
>To check this hypothesis I have implement in-memory table access
>method which stores tuples in unpacked form and
>doesn't perform any visibility checks at all.
>Results were not so existed. I have to disable parallel execution
>(because it is not possible for tuples stored in backend private
>memory).
>Results are the following:
>
>lineitem:               13736 msec
>inmem_lineitem: 10044 msec
>vops_lineitem:        1945 msec
>
>The profile of inmem_lineitem is the following:
>
> 16.79% postgres postgres             [.] float4_accum
> 12.86% postgres postgres             [.] float8_accum
>   5.83% postgres postgres             [.] TupleHashTableHash.isra.8
>   4.44% postgres postgres             [.] lookup_hash_entries
>   3.37% postgres postgres             [.] check_float8_array
>   3.11% postgres postgres             [.] tuplehash_insert
>   2.91% postgres postgres             [.] hash_uint32
>   2.83% postgres postgres             [.] ExecScan
>   2.56% postgres postgres             [.] inmem_getnextslot
>   2.22% postgres postgres             [.] FunctionCall1Coll
>   2.14% postgres postgres             [.] LookupTupleHashEntry
>   1.95% postgres postgres             [.] TupleHashTableMatch.isra.9
>   1.76% postgres postgres             [.] pg_detoast_datum
>   1.58% postgres postgres             [.] AggCheckCallContext
>   1.57% postgres postgres             [.] tts_minimal_clear
>   1.35% postgres perf-3054.map        [.] 0x00007f558db60010
>   1.23% postgres postgres             [.] fetch_input_tuple
>   1.15% postgres postgres             [.] SeqNext
>   1.06% postgres postgres             [.] ExecAgg
>   1.00% postgres postgres             [.] tts_minimal_store_tuple
>
>So now fraction of time spent in aggregation is increased to 30% (vs.
>20% for lineitem and 42% for vops_lineitem).
>Looks like the main bottleneck now is hashagg. VOPS is accessing hash
>about 10 times less (because it accumulates values for the whole
>tile).
>And it explains still large difference bwtween vops_lineitem and
>inmem_lineitem.
>
>If we remove aggregation and rewrite Q1 query as:
>select
>    avg(l_quantity) as sum_qty,
>    avg(l_extendedprice) as sum_base_price,
>    avg(l_extendedprice*(1-l_discount)) as sum_disc_price,
>    avg(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
>    avg(l_quantity) as avg_qty,
>    avg(l_extendedprice) as avg_price,
>    avg(l_discount) as avg_disc,
>    count(*) as count_order
>from
>    inmem_lineitem
>where
>    l_shipdate <= '1998-12-01';
>
>then results are the following:
>lineitem:               9805 msec
>inmem_lineitem: 6257 msec
>vops_lineitem:      1865 msec
>
>and now profile of inmem_lineitem is:
>
> 25.27% postgres postgres           [.] float4_accum
> 21.86% postgres postgres           [.] float8_accum
>   5.49% postgres postgres           [.] check_float8_array
>   4.57% postgres postgres           [.] ExecScan
>   2.61% postgres postgres           [.] AggCheckCallContext
>   2.30% postgres postgres           [.] pg_detoast_datum
>   2.10% postgres postgres           [.] inmem_getnextslot
>   1.81% postgres postgres           [.] SeqNext
>   1.73% postgres postgres           [.] fetch_input_tuple
>   1.61% postgres postgres           [.] ExecAgg
>   1.23% postgres postgres           [.] MemoryContextReset
>
>But still more than 3 times difference with VOPS!
>Something is wrong here...
>

I have no idea what VOPS does, but IIRC one of the bottlenecks compared
to various column stores is our iterative execution model, which makes
it difficult/imposible to vectorize operations. That's likely why the
accum functions are so high in the CPU profile.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Re: Why JIT speed improvement is so modest? at 2019-11-27 15:38:45 from Konstantin Knizhnik

Responses

Re: Why JIT speed improvement is so modest? at 2019-11-28 07:08:18 from Konstantin Knizhnik

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2019-11-27 16:09:44	Re: Invisible PROMPT2
Previous Message	Peter Eisentraut	2019-11-27 16:01:49	Re: allow_system_table_mods stuff