Re: Reducing planning time of large IN queries on primary key / unique columns

From: Souvik Bhattacherjee <pgsdbhacker(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Reducing planning time of large IN queries on primary key / unique columns
Date: 2022-08-11 22:04:15
Message-ID: CAJCGbZO1FwytN51tdfxJBNyPzuXgCixXio6tyevT09+p=0OjXw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

(Re-posting with better formatting)

Hi hackers,

At ServiceNow, we frequently encounter queries with very large IN lists
where the number of elements in the IN list range from a

few hundred to several thousand. For a significant fraction of the queries,
the IN clauses are constructed on primary key columns.

While planning these queries, Postgres query planner loops over every
element in the IN clause, computing the selectivity of each

element and then uses that as an input to compute the total selectivity of
the IN clause. For IN clauses on primary key or unique

columns, it is easy to see that the selectivity of the IN predicate is
given by (number of elements in the IN clause / table cardinality)

and is independent of the selectivity of the individual elements. We use
this observation to avoid computing the selectivities of the

individual elements. This results in an improvement in the planning time
especially when the number of elements in the IN clause

is relatively large.

The table below demonstrates the improvement in planning time (averaged
over 3 runs) for IN queries of the form

SELECT COUNT(*) FROM table_a WHERE sys_id IN
('000356e61b568510eabcca2b234bcb08', '00035846db2f24101ad7f256b9961925',
...).

Here sys_id is the primary key column of type VARCHAR(32) and the table
cardinality of table_a is around 10M.

Number of IN list elements | Planning time w/o optimization (in ms) | Planning
time w/ optimization (in ms) | Speedup

------------------------------------|---------------------------------------------------
| --------------------------------------------------|--------------

500 | 0.371
| 0.236
| 1.57

5000 | 2.019
| 0.874
| 2.31

50000 | 19.886
| 8.273
| 2.40

Similar to IN clauses, the selectivity of NOT IN clauses on a primary key
or unique column can be computed by not computing the

selectivities of individual elements. The query used is of the form SELECT
COUNT(*) FROM table_a WHERE sys_id NOT IN

('000356e61b568510eabcca2b234bcb08', '00035846db2f24101ad7f256b9961925',
...).

Number of NOT IN list elements | Planning time w/o optimization (in
ms) | Planning
time w/ optimization (in ms) | Speedup

-------------------------------------------|---------------------------------------------------
| --------------------------------------------------|--------------

500 | 0.380
| 0.248
| 1.53

5000 | 2.534
| 0.854
| 2.97

50000 | 21.316
| 9.009
| 2.36

We also obtain planning time of queries on a primary key column of type
INTEGER with 10M elements for both IN and NOT in queries.

Number of IN list elements | Planning time w/o optimization (in ms) | Planning
time w/ optimization (in ms) | Speedup

------------------------------------|---------------------------------------------------
| --------------------------------------------------|--------------

500 | 0.370
| 0.208
| 1.78

5000 | 1.998
| 0.816
| 2.45

50000 | 18.073
| 6.750
| 2.67

Number of NOT IN list elements | Planning time w/o optimization (in
ms) | Planning
time w/ optimization (in ms) | Speedup

-------------------------------------------|---------------------------------------------------
| --------------------------------------------------|--------------

500 | 0.342
| 0.203
| 1.68

5000 | 2.073
| 0.822
| 3.29

50000 |19.551
| 6.738
| 2.90

We see that the planning time of queries on unique columns are identical to
that we observed for primary key columns.

The resulting patch file for the changes above is small and we are happy to
polish it up and share.

Best,

Souvik Bhattacherjee

(ServiceNow)

On Thu, Aug 11, 2022 at 2:42 PM Souvik Bhattacherjee <pgsdbhacker(at)gmail(dot)com>
wrote:

> Hi hackers,
>
> At ServiceNow, we frequently encounter queries with very large IN lists
> where the number of elements in the IN list range from a few hundred to
> several thousand. For a significant fraction of the queries, the IN clauses
> are constructed on primary key columns. While planning these queries,
> Postgres query planner loops over every element in the IN clause, computing
> the selectivity of each element and then uses that as an input to compute
> the total selectivity of the IN clause. For IN clauses on primary key or
> unique columns, it is easy to see that the selectivity of the IN predicate
> is given by (number of elements in the IN clause / table cardinality) and
> is independent of the selectivity of the individual elements. We use this
> observation to avoid computing the selectivities of the individual
> elements. This results in an improvement in the planning time especially
> when the number of elements in the IN clause is relatively large.
>
>
>
> The table below demonstrates the improvement in planning time (averaged
> over 3 runs) for IN queries of the form SELECT COUNT(*) FROM table_a WHERE
> sys_id IN ('000356e61b568510eabcca2b234bcb08',
> '00035846db2f24101ad7f256b9961925', ...). Here sys_id is the primary key
> column of type VARCHAR(32) and the table cardinality of table_a is around
> 10M.
>
>
>
> Number of IN list elements
>
> Planning time w/o optimization (in ms)
>
> Planning time w/ optimization (in ms)
>
> Speedup
>
> 500
>
> 0.371
>
> 0.236
>
> 1.57
>
> 5000
>
> 2.019
>
> 0.874
>
> 2.31
>
> 50000
>
> 19.886
>
> 8.273
>
> 2.40
>
>
>
> Similar to IN clauses, the selectivity of NOT IN clauses on a primary key
> or unique column can be computed by not computing the selectivities of
> individual elements. The query used is of the form SELECT COUNT(*) FROM
> table_a WHERE sys_id NOT IN ('000356e61b568510eabcca2b234bcb08',
> '00035846db2f24101ad7f256b9961925', ...).
>
>
>
> Number of NOT IN list elements
>
> Planning time w/o optimization (in ms)
>
> Planning time w/ optimization (in ms)
>
> Speedup
>
> 500
>
> 0.380
>
> 0.248
>
> 1.53
>
> 5000
>
> 2.534
>
> 0.854
>
> 2.97
>
> 50000
>
> 21.316
>
> 9.009
>
> 2.36
>
>
>
> We also obtain planning time of queries on a primary key column of type
> INTEGER with 10M elements for both IN and NOT in queries.
>
>
> Number of IN list elements
>
> Planning time w/o optimization (in ms)
>
> Planning time w/ optimization (in ms)
>
> Speedup
>
> 500
>
> 0.370
>
> 0.208
>
> 1.78
>
> 5000
>
> 1.998
>
> 0.816
>
> 2.45
>
> 50000
>
> 18.073
>
> 6.750
>
> 2.67
>
>
>
>
> Number of NOT IN list elements
>
> Planning time w/o optimization (in ms)
>
> Planning time w/ optimization (in ms)
>
> Speedup
>
> 500
>
> 0.342
>
> 0.203
>
> 1.68
>
> 5000
>
> 2.073
>
> 0.822
>
> 3.29
>
> 50000
>
> 19.551
>
> 6.738
>
> 2.90
>
>
>
> We see that the planning time of queries on unique columns are identical
> to that we observed for primary key columns. The resulting patch file for
> the changes above is small and we are happy to polish it up and share.
>
>
> Best,
>
> Souvik Bhattacherjee
>
> (ServiceNow)
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jacob Champion 2022-08-11 22:28:31 Re: [PATCH] Expose port->authn_id to extensions and triggers
Previous Message Souvik Bhattacherjee 2022-08-11 21:42:15 Reducing planning time of large IN queries on primary key / unique columns