Re: Allow ILIKE forward matching to use btree index

From: Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>
To: Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Allow ILIKE forward matching to use btree index
Date: 2025-01-14 16:40:46
Message-ID: 20250115014046.980463d8d2d309d040974e63@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 24 Dec 2024 16:04:42 +0900
Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp> wrote:

> On Fri, 20 Dec 2024 03:22:26 +0900
> Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp> wrote:
>
> > Hi,
> >
> > Currently, btree indexes cannot used for ILIKE (~~*) operator if the pattern
> > has case-varying characters although LIKE (~~) expression can be converted
> > to indexable clauses by the planner support function (if the collation
> > is "C" or operator class 'text_pattern_ops' is used).
> >
> > For example, "t ~~ '123foo%'" is converted to "(t >= '123foo' AND t < '123fop')"
> > and index scan can be used for this condition. On the other hand, "t ~~* '123foo'"
> > cannot be converted and sequential scan is used.
> >
> > Even in this case, we can use a bitmap index scan for the condition
> > "(t >= '123f' AND t < '123g') OR "(t >= '123F' AND t < '123G')" followed by
> > recheck by the original condition "t ~~* '123foo'", and this could be faster
> > than seqscan.
> >
> > However, even if the support function would return OR clauses, the current
> > planner implementation cannot not build bitmap scan paths using them.
> >
> >
> > The attached patches allow to ILIKE (~~*) forward matching to use btree index.
> >
> > The patch 0001 enables get_index_clause_from_support to receive OR clauses
> > from support functions and use them to build bitmap index scan later. OR clauses
> > returned by support functions are collected in the new argument 'orclauses" of
> > match_restriction_clauses_to_index(), and they are passed to
> > generate_bitmap_or_paths() later to build bitmap scan paths.
> >
> > The patch 0002 modifies the support function to return OR clauses as an example
> > above when ILIKE's pattern has case-varying characters in forward matching. The
> > OR clauses contains two index conditions for the upper and the lower letter of
> > the first case-varying character in the pattern respectively. Although the
> > subsequent characters are not considered in the index scans, it could enough be
> > faster then sequent scan.
> >
> > Here is an example.
> >
> > 1. Create a table with random text records
> >
> > =# CREATE TABLE tbl (t text);
> > =# INSERT INTO tbl SELECT CASE WHEN i%2=1 THEN upper(x) ELSE x END
> > FROM (SELECT i, md5(i::text) x FROM generate_series(1,5000000) i);
> >
> > 2. Create an index
> > =# CREATE INDEX ON tbl (t text_pattern_ops);
> >
> > 3. Before applying patch: Seq Scan was selected. It takes about 4 sec.
> >
> > =# EXPLIN ANALYZE SELECT * FROM tbl WHERE t ILIKE '12ab%';
>
> I am sorry, the example in my previous main was wrong. I showed the plan
> with enable_index_scan = off. Indead, if the pattern starts with
> case-insensitive characters like '12', an index scan can be used even with
> ILIKE.
>
> postgres=# EXPLAIN ANALYZE SELECT * FROM tbl WHERE t ILIKE 'abc%';
> QUERY PLAN
> -------------------------------------------------------------------------------------------------------------------------
> Gather (cost=1000.00..69129.61 rows=500 width=33) (actual time=36.317..4034.770 rows=1188 loops=1)
> Workers Planned: 2
> Workers Launched: 2
> Buffers: shared hit=99 read=41939
> -> Parallel Seq Scan on tbl (cost=0.00..68079.61 rows=208 width=33) (actual time=19.908..4023.668 rows=396 loops=3)
> Filter: (t ~~* 'abc%'::text)
> Rows Removed by Filter: 1666271
> Buffers: shared hit=99 read=41939
> Planning Time: 0.726 ms
> Execution Time: 4035.101 ms
> (10 rows)
>
> 4. After applying patch: Bitmap Index Scan was selected.
>
> postgres=# EXPLAIN ANALYZE SELECT * FROM tbl WHERE t ILIKE 'abc%';
> QUERY PLAN
> --------------------------------------------------------------------------------------------------------------------------------------------------------------
> Bitmap Heap Scan on tbl (cost=12563.66..58314.53 rows=500 width=33) (actual time=156.485..1266.789 rows=1188 loops=1)
> Recheck Cond: (((t ~>=~ 'A'::text) AND (t ~<~ 'B'::text) AND (t ~~* 'abc%'::text)) OR ((t ~>=~ 'a'::text) AND (t ~<~ 'b'::text) AND (t ~~* 'abc%'::text)))
> Filter: (t ~~* 'abc%'::text)
> Rows Removed by Filter: 311473
> Heap Blocks: exact=42010
> Buffers: shared hit=1 read=44600
> -> BitmapOr (cost=12563.66..12563.66 rows=297029 width=0) (actual time=136.979..136.980 rows=0 loops=1)
> Buffers: shared hit=1 read=2590
> -> Bitmap Index Scan on tbl_t_idx (cost=0.00..6281.71 rows=148515 width=0) (actual time=80.548..80.549 rows=158375 loops=1)
> Index Cond: ((t ~>=~ 'A'::text) AND (t ~<~ 'B'::text))
> Buffers: shared read=1272
> -> Bitmap Index Scan on tbl_t_idx (cost=0.00..6281.71 rows=148515 width=0) (actual time=56.427..56.427 rows=157042 loops=1)
> Index Cond: ((t ~>=~ 'a'::text) AND (t ~<~ 'b'::text))
> Buffers: shared hit=1 read=1318
> Planning Time: 0.592 ms
> Execution Time: 1267.162 ms
> (16 rows)
>
>
> > What do you think about it?
> >
> > I think another application of OR-clause returning support function might be
> > allowing to use an index for NOT LIKE (!~~) expression because, for example,
> > "t !~~ '123foo%'" could be converted to "(t < '123foo' OR t >= '123fooz')".
> > (The second condition is a bit loose but this would be safe and not problem
> > since tuples are filtered by the original condition after the index scan.)
> > However, it would not very useful unless the distribution is much skew so that
> > NOT LIKE expression's selectivity is enough small.

I added a new patch 0003 that enables ILIKE forward matching to use a SP-Gist index.
Similar to 0002, this generates BitmapOr Index Scan for two index clauses that use
"^@" operator for upper letter and lower letter pattern respectively.

* Before applying the patch:

postgres=# explain analyze select* from tbl where t ilike 'abc%';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..18899.52 rows=101 width=33) (actual time=41.799..930.382 rows=253 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=96 read=12533
-> Parallel Seq Scan on tbl (cost=0.00..17889.42 rows=42 width=33) (actual time=26.041..917.570 rows=84 loops=3)
Filter: (t ~~* 'abc%'::text)
Rows Removed by Filter: 336582
Buffers: shared hit=96 read=12533
Planning Time: 0.690 ms
Execution Time: 930.545 ms
(10 rows)

* After applying the patch:

postgres=# explain analyze select* from tbl where t ilike 'abc%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl (cost=3307.96..16702.11 rows=101 width=33) (actual time=69.449..215.636 rows=253 loops=1)
Recheck Cond: (((t ^@ 'A'::text) AND (t ~~* 'abc%'::text)) OR ((t ^@ 'a'::text) AND (t ~~* 'abc%'::text)))
Filter: (t ~~* 'abc%'::text)
Rows Removed by Filter: 62992
Heap Blocks: exact=12437
Buffers: shared hit=18529
-> BitmapOr (cost=3307.96..3307.96 rows=61212 width=0) (actual time=62.567..62.568 rows=0 loops=1)
Buffers: shared hit=6092
-> Bitmap Index Scan on tbl_t_idx (cost=0.00..1653.96 rows=30606 width=0) (actual time=41.893..41.893 rows=31536 loops=1)
Index Cond: (t ^@ 'A'::text)
Buffers: shared hit=2461
-> Bitmap Index Scan on tbl_t_idx (cost=0.00..1653.96 rows=30606 width=0) (actual time=20.671..20.671 rows=31709 loops=1)
Index Cond: (t ^@ 'a'::text)
Buffers: shared hit=3631
Planning Time: 1.414 ms
Execution Time: 215.903 ms
(16 rows)

--
Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>

Attachment Content-Type Size
v2-0003-Allow-ILIKE-forward-matching-to-use-spgist-index.patch text/x-diff 3.3 KB
v2-0002-Allow-ILIKE-forward-matching-to-use-btree-index.patch text/x-diff 11.3 KB
v2-0001-Allow-planner-support-function-to-return-index-co.patch text/x-diff 14.5 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2025-01-14 16:57:51 Re: [PATCH] Hex-coding optimizations using SVE on ARM.
Previous Message Alvaro Herrera 2025-01-14 16:37:14 Re: refactor AlterDomainAddConstraint (alter domain add constraint)