Reducing planning time of large IN queries on primary key / unique columns

From: Souvik Bhattacherjee <pgsdbhacker(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Reducing planning time of large IN queries on primary key / unique columns
Date: 2022-08-11 21:42:15
Message-ID: CAJCGbZNiViHRTR2evS7ecbj3ihKk-gPxWG8im7Wr9U4xa-31UQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

At ServiceNow, we frequently encounter queries with very large IN lists
where the number of elements in the IN list range from a few hundred to
several thousand. For a significant fraction of the queries, the IN clauses
are constructed on primary key columns. While planning these queries,
Postgres query planner loops over every element in the IN clause, computing
the selectivity of each element and then uses that as an input to compute
the total selectivity of the IN clause. For IN clauses on primary key or
unique columns, it is easy to see that the selectivity of the IN predicate
is given by (number of elements in the IN clause / table cardinality) and
is independent of the selectivity of the individual elements. We use this
observation to avoid computing the selectivities of the individual
elements. This results in an improvement in the planning time especially
when the number of elements in the IN clause is relatively large.

The table below demonstrates the improvement in planning time (averaged
over 3 runs) for IN queries of the form SELECT COUNT(*) FROM table_a WHERE
sys_id IN ('000356e61b568510eabcca2b234bcb08',
'00035846db2f24101ad7f256b9961925', ...). Here sys_id is the primary key
column of type VARCHAR(32) and the table cardinality of table_a is around
10M.

Number of IN list elements

Planning time w/o optimization (in ms)

Planning time w/ optimization (in ms)

Speedup

500

0.371

0.236

1.57

5000

2.019

0.874

2.31

50000

19.886

8.273

2.40

Similar to IN clauses, the selectivity of NOT IN clauses on a primary key
or unique column can be computed by not computing the selectivities of
individual elements. The query used is of the form SELECT COUNT(*) FROM
table_a WHERE sys_id NOT IN ('000356e61b568510eabcca2b234bcb08',
'00035846db2f24101ad7f256b9961925', ...).

Number of NOT IN list elements

Planning time w/o optimization (in ms)

Planning time w/ optimization (in ms)

Speedup

500

0.380

0.248

1.53

5000

2.534

0.854

2.97

50000

21.316

9.009

2.36

We also obtain planning time of queries on a primary key column of type
INTEGER with 10M elements for both IN and NOT in queries.

Number of IN list elements

Planning time w/o optimization (in ms)

Planning time w/ optimization (in ms)

Speedup

500

0.370

0.208

1.78

5000

1.998

0.816

2.45

50000

18.073

6.750

2.67

Number of NOT IN list elements

Planning time w/o optimization (in ms)

Planning time w/ optimization (in ms)

Speedup

500

0.342

0.203

1.68

5000

2.073

0.822

3.29

50000

19.551

6.738

2.90

We see that the planning time of queries on unique columns are identical to
that we observed for primary key columns. The resulting patch file for the
changes above is small and we are happy to polish it up and share.

Best,

Souvik Bhattacherjee

(ServiceNow)

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Souvik Bhattacherjee 2022-08-11 22:04:15 Re: Reducing planning time of large IN queries on primary key / unique columns
Previous Message Andres Freund 2022-08-11 21:12:46 Re: hash_xlog_split_allocate_page: failed to acquire cleanup lock