| From: | Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp> | 
|---|---|
| To: | Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp> | 
| Cc: | <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Re: Allow ILIKE forward matching to use btree index | 
| Date: | 2025-01-14 16:40:46 | 
| Message-ID: | 20250115014046.980463d8d2d309d040974e63@sraoss.co.jp | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On Tue, 24 Dec 2024 16:04:42 +0900
Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp> wrote:
> On Fri, 20 Dec 2024 03:22:26 +0900
> Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp> wrote:
> 
> > Hi,
> > 
> > Currently, btree indexes cannot used for ILIKE (~~*) operator if the pattern
> > has case-varying characters although LIKE (~~) expression can be converted
> > to indexable clauses by the planner support function (if the collation
> > is "C" or operator class 'text_pattern_ops' is used).
> > 
> > For example, "t ~~ '123foo%'" is converted to "(t >= '123foo' AND t < '123fop')" 
> > and index scan can be used for this condition. On the other hand, "t ~~* '123foo'"
> > cannot be converted and sequential scan is used. 
> > 
> > Even in this case, we can use a bitmap index scan for the condition 
> > "(t >= '123f' AND t < '123g') OR "(t >= '123F' AND t < '123G')" followed by
> > recheck by the original condition "t ~~* '123foo'", and this could be faster
> > than seqscan. 
> > 
> > However, even if the support function would return OR clauses, the current
> > planner implementation cannot not build bitmap scan paths using them. 
> > 
> > 
> > The attached patches allow to ILIKE (~~*) forward matching to use btree index. 
> > 
> > The patch 0001 enables get_index_clause_from_support to receive OR clauses
> > from support functions and use them to build bitmap index scan later. OR clauses
> > returned by support functions are collected in the new argument 'orclauses" of
> > match_restriction_clauses_to_index(), and they are passed to
> > generate_bitmap_or_paths() later to build bitmap scan paths.
> > 
> > The patch 0002 modifies the support function to return OR clauses as an example
> > above when ILIKE's pattern has case-varying characters in forward matching. The
> > OR clauses contains two index conditions for the upper and the lower letter of
> > the first case-varying character in the pattern respectively. Although the
> > subsequent characters are not considered in the index scans, it could enough be
> > faster then sequent scan.
> > 
> > Here is an example. 
> > 
> > 1. Create a table with random text records
> > 
> > =# CREATE TABLE tbl (t text);
> > =# INSERT INTO tbl SELECT CASE WHEN i%2=1 THEN upper(x) ELSE x END
> >                   FROM (SELECT i, md5(i::text) x FROM generate_series(1,5000000) i);
> > 
> > 2. Create an index
> > =# CREATE INDEX ON tbl (t text_pattern_ops);
> > 
> > 3. Before applying patch: Seq Scan was selected. It takes about 4 sec.
> > 
> > =# EXPLIN ANALYZE SELECT * FROM tbl WHERE t ILIKE '12ab%';
> 
> I am sorry, the example in my previous main was wrong. I showed the plan
> with enable_index_scan = off. Indead, if the pattern starts with
> case-insensitive characters like '12', an index scan can be used even with
> ILIKE. 
> 
> postgres=# EXPLAIN ANALYZE SELECT * FROM tbl WHERE t ILIKE 'abc%';
>                                                        QUERY PLAN                                                        
> -------------------------------------------------------------------------------------------------------------------------
>  Gather  (cost=1000.00..69129.61 rows=500 width=33) (actual time=36.317..4034.770 rows=1188 loops=1)
>    Workers Planned: 2
>    Workers Launched: 2
>    Buffers: shared hit=99 read=41939
>    ->  Parallel Seq Scan on tbl  (cost=0.00..68079.61 rows=208 width=33) (actual time=19.908..4023.668 rows=396 loops=3)
>          Filter: (t ~~* 'abc%'::text)
>          Rows Removed by Filter: 1666271
>          Buffers: shared hit=99 read=41939
>  Planning Time: 0.726 ms
>  Execution Time: 4035.101 ms
> (10 rows)
> 
> 4. After applying patch: Bitmap Index Scan was selected. 
> 
> postgres=# EXPLAIN ANALYZE SELECT * FROM tbl WHERE t ILIKE 'abc%';
>                                                                           QUERY PLAN                                                                          
> --------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on tbl  (cost=12563.66..58314.53 rows=500 width=33) (actual time=156.485..1266.789 rows=1188 loops=1)
>    Recheck Cond: (((t ~>=~ 'A'::text) AND (t ~<~ 'B'::text) AND (t ~~* 'abc%'::text)) OR ((t ~>=~ 'a'::text) AND (t ~<~ 'b'::text) AND (t ~~* 'abc%'::text)))
>    Filter: (t ~~* 'abc%'::text)
>    Rows Removed by Filter: 311473
>    Heap Blocks: exact=42010
>    Buffers: shared hit=1 read=44600
>    ->  BitmapOr  (cost=12563.66..12563.66 rows=297029 width=0) (actual time=136.979..136.980 rows=0 loops=1)
>          Buffers: shared hit=1 read=2590
>          ->  Bitmap Index Scan on tbl_t_idx  (cost=0.00..6281.71 rows=148515 width=0) (actual time=80.548..80.549 rows=158375 loops=1)
>                Index Cond: ((t ~>=~ 'A'::text) AND (t ~<~ 'B'::text))
>                Buffers: shared read=1272
>          ->  Bitmap Index Scan on tbl_t_idx  (cost=0.00..6281.71 rows=148515 width=0) (actual time=56.427..56.427 rows=157042 loops=1)
>                Index Cond: ((t ~>=~ 'a'::text) AND (t ~<~ 'b'::text))
>                Buffers: shared hit=1 read=1318
>  Planning Time: 0.592 ms
>  Execution Time: 1267.162 ms
> (16 rows)
> 
>  
> > What do you think about it?
> > 
> > I think another application of OR-clause returning support function might be
> > allowing to use an index for NOT LIKE (!~~) expression  because, for example, 
> > "t !~~ '123foo%'" could be converted to "(t < '123foo' OR t >= '123fooz')". 
> > (The second condition is a bit loose but this would be safe and not problem
> > since tuples are filtered by the original condition after the index scan.) 
> > However, it would not very useful unless the distribution is much skew so that
> > NOT LIKE expression's selectivity is enough small.
I added a new patch 0003 that enables ILIKE forward matching to use a SP-Gist index.
Similar to 0002, this generates BitmapOr Index Scan for two index clauses that use
"^@" operator for upper letter and lower letter pattern respectively. 
* Before applying the patch:
postgres=# explain analyze select* from tbl where t ilike 'abc%';
                                                      QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..18899.52 rows=101 width=33) (actual time=41.799..930.382 rows=253 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=96 read=12533
   ->  Parallel Seq Scan on tbl  (cost=0.00..17889.42 rows=42 width=33) (actual time=26.041..917.570 rows=84 loops=3)
         Filter: (t ~~* 'abc%'::text)
         Rows Removed by Filter: 336582
         Buffers: shared hit=96 read=12533
 Planning Time: 0.690 ms
 Execution Time: 930.545 ms
(10 rows)
* After applying the patch:
postgres=# explain analyze select* from tbl where t ilike 'abc%';
                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tbl  (cost=3307.96..16702.11 rows=101 width=33) (actual time=69.449..215.636 rows=253 loops=1)
   Recheck Cond: (((t ^@ 'A'::text) AND (t ~~* 'abc%'::text)) OR ((t ^@ 'a'::text) AND (t ~~* 'abc%'::text)))
   Filter: (t ~~* 'abc%'::text)
   Rows Removed by Filter: 62992
   Heap Blocks: exact=12437
   Buffers: shared hit=18529
   ->  BitmapOr  (cost=3307.96..3307.96 rows=61212 width=0) (actual time=62.567..62.568 rows=0 loops=1)
         Buffers: shared hit=6092
         ->  Bitmap Index Scan on tbl_t_idx  (cost=0.00..1653.96 rows=30606 width=0) (actual time=41.893..41.893 rows=31536 loops=1)
               Index Cond: (t ^@ 'A'::text)
               Buffers: shared hit=2461
         ->  Bitmap Index Scan on tbl_t_idx  (cost=0.00..1653.96 rows=30606 width=0) (actual time=20.671..20.671 rows=31709 loops=1)
               Index Cond: (t ^@ 'a'::text)
               Buffers: shared hit=3631
 Planning Time: 1.414 ms
 Execution Time: 215.903 ms
(16 rows)
-- 
Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>
| Attachment | Content-Type | Size | 
|---|---|---|
| v2-0003-Allow-ILIKE-forward-matching-to-use-spgist-index.patch | text/x-diff | 3.3 KB | 
| v2-0002-Allow-ILIKE-forward-matching-to-use-btree-index.patch | text/x-diff | 11.3 KB | 
| v2-0001-Allow-planner-support-function-to-return-index-co.patch | text/x-diff | 14.5 KB | 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Nathan Bossart | 2025-01-14 16:57:51 | Re: [PATCH] Hex-coding optimizations using SVE on ARM. | 
| Previous Message | Alvaro Herrera | 2025-01-14 16:37:14 | Re: refactor AlterDomainAddConstraint (alter domain add constraint) |