Allow ILIKE forward matching to use btree index

From: Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Allow ILIKE forward matching to use btree index
Date: 2024-12-19 18:22:26
Message-ID: 20241220032226.9a3429207417d28b7e482024@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Currently, btree indexes cannot used for ILIKE (~~*) operator if the pattern
has case-varying characters although LIKE (~~) expression can be converted
to indexable clauses by the planner support function (if the collation
is "C" or operator class 'text_pattern_ops' is used).

For example, "t ~~ '123foo%'" is converted to "(t >= '123foo' AND t < '123fop')"
and index scan can be used for this condition. On the other hand, "t ~~* '123foo'"
cannot be converted and sequential scan is used.

Even in this case, we can use a bitmap index scan for the condition
"(t >= '123f' AND t < '123g') OR "(t >= '123F' AND t < '123G')" followed by
recheck by the original condition "t ~~* '123foo'", and this could be faster
than seqscan.

However, even if the support function would return OR clauses, the current
planner implementation cannot not build bitmap scan paths using them.

The attached patches allow to ILIKE (~~*) forward matching to use btree index.

The patch 0001 enables get_index_clause_from_support to receive OR clauses
from support functions and use them to build bitmap index scan later. OR clauses
returned by support functions are collected in the new argument 'orclauses" of
match_restriction_clauses_to_index(), and they are passed to
generate_bitmap_or_paths() later to build bitmap scan paths.

The patch 0002 modifies the support function to return OR clauses as an example
above when ILIKE's pattern has case-varying characters in forward matching. The
OR clauses contains two index conditions for the upper and the lower letter of
the first case-varying character in the pattern respectively. Although the
subsequent characters are not considered in the index scans, it could enough be
faster then sequent scan.

Here is an example.

1. Create a table with random text records

=# CREATE TABLE tbl (t text);
=# INSERT INTO tbl SELECT CASE WHEN i%2=1 THEN upper(x) ELSE x END
FROM (SELECT i, md5(i::text) x FROM generate_series(1,5000000) i);

2. Create an index
=# CREATE INDEX ON tbl (t text_pattern_ops);

3. Before applying patch: Seq Scan was selected. It takes about 4 sec.

=# EXPLIN ANALYZE SELECT * FROM tbl WHERE t ILIKE '12ab%';

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..58712.97 rows=500 width=33) (actual time=101.096..4110.979 rows=64 loops=1)
Workers Planned: 4
Workers Launched: 4
Buffers: shared hit=10778 read=31260
-> Parallel Seq Scan on tbl (cost=0.00..57662.97 rows=125 width=33) (actual time=440.232..4101.023 rows=13 loops=5)
Filter: (t ~~* '12ab%'::text)
Rows Removed by Filter: 999987
Buffers: shared hit=10778 read=31260
Planning Time: 0.415 ms
Execution Time: 4111.051 ms
(10 rows)

4. After applying patch: Bitmap Index Scan was selected. It takes only 10 ms.

=# EXPLIN ANALYZE SELECT * FROM tbl WHERE t ILIKE '12ab%';
QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------
--
Bitmap Heap Scan on tbl (cost=9.38..13.40 rows=500 width=33) (actual time=3.720..9.660 rows=64 loops=1)
Recheck Cond: (((t ~>=~ '12A'::text) AND (t ~<~ '12B'::text) AND (t ~~* '12ab%'::text)) OR ((t ~>=~ '12a'::text) AND (t ~<~ '12b'::text) AND (t ~~* '12ab%'::text))
)
Filter: (t ~~* '12ab%'::text)
Rows Removed by Filter: 1205
Heap Blocks: exact=1254
Buffers: shared hit=1271
-> BitmapOr (cost=9.38..9.38 rows=1 width=0) (actual time=2.370..2.372 rows=0 loops=1)
Buffers: shared hit=17
-> Bitmap Index Scan on tbl_t_idx (cost=0.00..4.57 rows=1 width=0) (actual time=1.192..1.193 rows=604 loops=1)
Index Cond: ((t ~>=~ '12A'::text) AND (t ~<~ '12B'::text))
Buffers: shared hit=8
-> Bitmap Index Scan on tbl_t_idx (cost=0.00..4.57 rows=1 width=0) (actual time=1.174..1.174 rows=675 loops=1)
Index Cond: ((t ~>=~ '12a'::text) AND (t ~<~ '12b'::text))
Buffers: shared hit=9
Planning Time: 1.144 ms
Execution Time: 9.785 ms
(16 rows)

What do you think about it?

I think another application of OR-clause returning support function might be
allowing to use an index for NOT LIKE (!~~) expression because, for example,
"t !~~ '123foo%'" could be converted to "(t < '123foo' OR t >= '123fooz')".
(The second condition is a bit loose but this would be safe and not problem
since tuples are filtered by the original condition after the index scan.)
However, it would not very useful unless the distribution is much skew so that
NOT LIKE expression's selectivity is enough small.

Regards,
Yugo Nagata

--
Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>

Attachment Content-Type Size
0002-Allow-ILIKE-forward-matching-to-use-btree-index.patch text/x-diff 11.3 KB
0001-Allow-planner-support-function-to-return-index-condi.patch text/x-diff 14.5 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Cary Huang 2024-12-19 19:05:44 Re: sslinfo extension - add notbefore and notafter timestamps
Previous Message Fujii Masao 2024-12-19 17:57:02 Re: Doc: clarify the log message level of the VERBOSE option