Re: Potential performance issues related to group by and covering index

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: "Liu, Xinyu" <liuxy(at)gatech(dot)edu>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Potential performance issues related to group by and covering index
Date: 2021-03-02 09:08:47
Message-ID: CAFj8pRAzBTpzbXr5Tu9p3ftPqUVpy0kCHX4wXiVz1goSJ1yTzw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

út 2. 3. 2021 v 9:53 odesílatel Liu, Xinyu <liuxy(at)gatech(dot)edu> napsal:

>
>
>
>
>
>
>
>
>
>
> * Hello, We have 2 TPC-H queries which fetch the same tuples but have
> significant query execution time differences (4.3 times). We are sharing a
> pair of TPC-H queries that exhibit this performance difference: First
> query: SELECT "ps_comment", "ps_suppkey", "ps_supplycost",
> "ps_partkey", "ps_availqty" FROM "partsupp" WHERE
> "ps_partkey" + 16 < 1 OR "ps_partkey" = 2 GROUP
> BY "ps_partkey", "ps_suppkey", "ps_availqty",
> "ps_supplycost", "ps_comment" Second query: SELECT
> "ps_comment", "ps_suppkey", "ps_supplycost",
> "ps_partkey", "ps_availqty" FROM "partsupp" WHERE
> "ps_partkey" + 16 < 1 OR "ps_partkey" = 2 GROUP BY
> "ps_comment", "ps_suppkey", "ps_supplycost",
> "ps_partkey", "ps_availqty" * Actual Behavior We
> executed both queries on the TPC-H benchmark of scale factor 5: the first
> query takes over 1.7 seconds, while the second query only takes 0.4
> seconds. We think the time difference results from different plans
> selected. Specifically, in the first (slow) query, the DBMS performs an
> index scan on table partsupp using the covering index (ps_partkey,
> ps_suppkey), while the second (fast) query performs a parallel scan on
> (ps_suppkey, ps_partkey). * Query Execution Plan - First query:
> QUERY
> PLAN
> ----------------------------------------------------------------------------------------------------------------------------------------------
> Group (cost=0.43..342188.58 rows=399262 width=144) (actual
> time=0.058..1737.659 rows=4 loops=1) Group Key: ps_partkey, ps_suppkey
> Buffers: shared hit=123005 read=98055 -> Index Scan using
> partsupp_pkey on partsupp (cost=0.43..335522.75 rows=1333167 width=144)
> (actual time=0.055..1737.651 rows=4 loops=1) Filter: (((ps_partkey
> + 16) < 1) OR (ps_partkey = 2)) Rows Removed by Filter: 3999996
> Buffers: shared hit=123005 read=98055 Planning Time: 0.926 ms
> Execution Time: 1737.754 ms (9 rows) *
>

In this case there is brutal overestimation. Probably due planner
unfriendly written predicate ps_partkey + 16 < 1) OR ps_parkey = 2. You
can try to rewrite this predicate to ps_parthkey < -15 OR ps_parkey = 2

Regards

Pavel

>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> * - Second query:
> QUERY
> PLAN
> ---------------------------------------------------------------------------------------------------------------------------------------------------
> Group (cost=250110.68..350438.93 rows=399262 width=144) (actual
> time=400.353..400.361 rows=4 loops=1) Group Key: ps_suppkey, ps_partkey
> Buffers: shared hit=5481 read=24093 -> Gather Merge
> (cost=250110.68..346446.31 rows=798524 width=144) (actual
> time=400.351..406.741 rows=4 loops=1) Workers Planned: 2
> Workers Launched: 2 Buffers: shared hit=15151 read=72144
> -> Group (cost=249110.66..253276.80 rows=399262 width=144)
> (actual time=395.882..395.883 rows=1 loops=3) Group Key:
> ps_suppkey, ps_partkey Buffers: shared hit=15151 read=72144
> -> Sort (cost=249110.66..250499.37 rows=555486 width=144)
> (actual time=395.880..395.881 rows=1 loops=3) Sort
> Key: ps_suppkey, ps_partkey Sort Method: quicksort
> Memory: 25kB Worker 0: Sort Method: quicksort
> Memory: 25kB Worker 1: Sort Method: quicksort
> Memory: 25kB Buffers: shared hit=15151 read=72144
> -> Parallel Seq Scan on partsupp
> (cost=0.00..116363.88 rows=555486 width=144) (actual time=395.518..395.615
> rows=1 loops=3) Filter: (((ps_partkey + 16) < 1)
> OR (ps_partkey = 2)) Rows Removed by Filter:
> 1333332 Buffers: shared hit=15065 read=72136
> Planning Time: 0.360 ms Execution Time: 406.880 ms (22 rows) *Expected
> Behavior Since these two queries are semantically equivalent, we were
> hoping that PostgreSQL would evaluate them in roughly the same amount of
> time. It looks to me that different order of group by clauses triggers
> different plans: when the group by clauses (ps_partkey, ps_suppkey) is the
> same as the covering index, it will trigger an index scan on associated
> columns; however, when the group by clauses have different order than the
> covering index (ps_suppkey, ps_partkey), the index scan will not be
> triggered. Given that the user might not pay close attention to this subtle
> difference, I was wondering if it is worth making these two queries have
> the same and predictable performance on Postgresql. *Test Environment
> Ubuntu 20.04 machine "Linux panda 5.4.0-40-generic #44-Ubuntu SMP Tue Jun
> 23 00:01:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux" PostgreSQL v12.3
> Database: TPC-H benchmark (with scale factor 5) The description of table
> partsupp is as follows: tpch5=# \d partsupp; Table
> "public.partsupp" Column | Type | Collation |
> Nullable | Default
> ---------------+------------------------+-----------+----------+---------
> ps_partkey | integer | | not null |
> ps_suppkey | integer | | not null |
> ps_availqty | integer | | not null |
> ps_supplycost | numeric(15,2) | | not null |
> ps_comment | character varying(199) | | not null | Indexes:
> "partsupp_pkey" PRIMARY KEY, btree (ps_partkey, ps_suppkey) Foreign-key
> constraints: "partsupp_fk1" FOREIGN KEY (ps_suppkey) REFERENCES
> supplier(s_suppkey) "partsupp_fk2" FOREIGN KEY (ps_partkey) REFERENCES
> part(p_partkey) Referenced by: TABLE "lineitem" CONSTRAINT
> "lineitem_fk2" FOREIGN KEY (l_partkey, l_suppkey) REFERENCES
> partsupp(ps_partkey, ps_suppkey) *Here are the steps for reproducing our
> observations: 1. Download the dataset from the link:
> https://drive.google.com/file/d/13rFa1BNDi4e2RmXBn-yEQkcqt6lsBu1c/view?usp=sharing
> <https://drive.google.com/file/d/13rFa1BNDi4e2RmXBn-yEQkcqt6lsBu1c/view?usp=sharing>
> 2. Set up TPC-H benchmark tar xzvf tpch5_postgresql.tar.gz cd
> tpch5_postgresql db=tpch5 createdb $db psql -d $db < dss.ddl for i in `ls
> *.tbl` do echo $i name=`echo $i|cut -d'.' -f1` psql -d $db -c
> "COPY $name FROM '`pwd`/$i' DELIMITER '|' ENCODING 'LATIN1';" done psql -d
> $db < dss_postgres.ri 1. Execute the queries *
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message David Rowley 2021-03-02 09:49:20 Re: Potential performance issues related to group by and covering index
Previous Message Liu, Xinyu 2021-03-02 04:47:00 Performance issues related to left join and order by