Re: Performance degradation in Index searches with special characters

From: Shiv Iyer <shiv(at)minervadb(dot)com>
To: Andrey Stikheev <andrey(dot)stikheev(at)gmail(dot)com>
Cc: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Re: Performance degradation in Index searches with special characters
Date: 2024-10-06 20:02:01
Message-ID: CAALLqt9Cunmhi0TqhjgvV2Z+fgkzcaM8WMUuLTcrOetQn3ze4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi Andrey,

I have tried my best to answer your queries below:

### Performance Degradation with Special Characters in PostgreSQL

#### **Explanation**:
The performance degradation you're experiencing when using special
characters like `<`, `(at)`, `#`, etc., is likely due to how PostgreSQL
handles **collations**. By default, PostgreSQL uses locale-aware collations
(typically UTF-8) for string comparisons, which involve complex character
sorting and encoding. Special characters may be treated differently under
certain locales, resulting in slower comparisons, especially with longer
strings.

#### **Key Observations**:
1. **Alphabetic vs. Special Characters**:
- Queries with alphabetic characters perform significantly faster than
those with special characters.
- This is due to differences in how the collation handles comparisons
for alphabetic and non-alphabetic characters. Special characters often
require more computational resources for comparison, resulting in longer
execution times.

2. **Impact of Collation**:
- When you switched the collation to **"C"**, the performance improved
substantially. This is because the `"C"` collation uses **byte-wise**
comparisons rather than locale-aware sorting, which simplifies the
comparison logic, especially for special characters.

3. **String Length Impact**:
- As the string length increases, the performance degrades exponentially
when using special characters. This is due to the collation’s computational
complexity for each additional character comparison.

#### **Recommendations**:

1. **Use the "C" Collation**:
- The most effective improvement you observed came from using the `"C"`
collation, which performs simple byte-by-byte comparisons. You can apply
the `"C"` collation at the query level or alter the column definition:
```sql
SELECT 1 FROM test WHERE value = repeat('<', 65536) COLLATE "C";
```
or
```sql
ALTER TABLE test ALTER COLUMN value SET DATA TYPE VARCHAR(10) COLLATE
"C";
```

2. **Partial Indexing**:
- Create an index with a specific collation for certain characters. This
allows PostgreSQL to use the optimized comparison method only when dealing
with special characters:
```sql
CREATE INDEX idx_test_special_chars ON test (value COLLATE "C");
```

3. **Reduce String Length**:
- If possible, reduce the string length for comparisons involving
special characters. The performance impact grows with longer strings due to
the nature of locale-aware collation.

4. **Use Functional Indexes**:
- You can use functional indexes that apply transformations (e.g.,
trimming special characters or converting to ASCII) before comparison:
```sql
CREATE INDEX idx_test_transformed ON test (lower(value));
```

5. **Optimize Queries**:
- For repeated queries involving long strings with special characters,
try caching results or using materialized views where the collation
overhead is reduced by pre-computing.

6. **Custom Collations**:
- If you need locale-aware sorting but can compromise on certain
aspects, consider creating a **custom collation** that simplifies special
character handling.

#### **Answers to Your Questions**:
1. **Is the performance degradation expected due to collation handling of
certain special characters?**
- Yes, this behavior is expected. Locale-aware collations can be complex
for special characters, leading to longer comparison times.

2. **Are there any recommendations to improve performance without changing
the column or database collation?**
- Without changing the collation entirely, you can use:
- **Query-level collation adjustments** (using the `"C"` collation in
specific queries).
- **Partial indexes** with the `"C"` collation.
- **Functional indexes** or optimizations like materialized views.

On Sun, Oct 6, 2024 at 3:53 PM Andrey Stikheev <andrey(dot)stikheev(at)gmail(dot)com>
wrote:

> Dear PostgreSQL Community,
>
> I am facing significant performance issues when executing queries that
> involve string comparisons with special characters, such as <, #, !, @,
> etc., especially when dealing with long strings. The query execution time
> increases drastically when these characters are used, whereas queries with
> alphabetic characters do not show such degradation. This behavior is
> observed both on macOS (using the official postgres:17 image via Docker)
> and on an Ubuntu 20.04 server running PostgreSQL in an LXC container.
>
> Here is a minimal example:
>
> testdb=# SELECT version();
>
> version
>
>
>
> ---------------------------------------------------------------------------------------------------------------------------
>
> PostgreSQL 17.0 (Debian 17.0-1.pgdg120+1) on aarch64-unknown-linux-gnu,
> compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
>
> (1 row)
>
> testdb=# CREATE TABLE test (value VARCHAR(10) NOT NULL);
> CREATE TABLE
> Time: 3.562 ms
>
> testdb=# CREATE INDEX idx_test ON test (value);
> CREATE INDEX
> Time: 3.080 ms
>
> testdb=# INSERT INTO test (value) VALUES ('<');
> INSERT 0 1
> Time: 3.365 ms
>
> testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536);
> ?column?
> ----------
> (0 rows)
> Time: 4454.535 ms (00:04.455)
>
> testdb=# SELECT 1 FROM test WHERE value = repeat('a', 65536);
> ?column?
> ----------
> (0 rows)
> Time: 3.772 ms
>
> testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536) || 'a';
> ?column?
> ----------
> (0 rows)
> Time: 4.352 ms
>
>
> Time: 9.503 ms
>
> testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536) COLLATE "C";
>
> ?column?
>
> ----------
>
> (0 rows)
>
>
> Time: 3.299 ms
>
>
> testdb=# SELECT 1 FROM test WHERE value = repeat('@', 8192);
>
> ?column?
>
> ----------
>
> (0 rows)
>
>
> Time: 77.171 ms
>
> testdb=# SELECT 1 FROM test WHERE value = repeat('@', 16384);
>
> ?column?
>
> ----------
>
> (0 rows)
>
>
> Time: 325.190 ms
>
> testdb=# SELECT 1 FROM test WHERE value = repeat('@', 32768);
>
> ?column?
>
> ----------
>
> (0 rows)
>
>
> Time: 1154.850 ms (00:01.155)
>
> testdb=# SELECT 1 FROM test WHERE value = repeat('@', 65536);
>
> ?column?
>
> ----------
>
> (0 rows)
>
>
> Time: 4490.206 ms (00:04.490)
>
>
> testdb=# explain (analyze, verbose, buffers, costs, settings, timing, wal)
> SELECT 1 FROM test WHERE value = repeat('<', 65000);
>
>
> ---------------------------------------------------------------------------------------------
> Bitmap Heap Scan on public.test (cost=4.20..13.67 rows=6 width=4)
> (actual time=4425.459..4425.459 rows=0 loops=1)
> Output: 1
> Recheck Cond: ((test.value)::text = '<<<<<<<<<<<<<<<<<<<<<<<<<<<< ... a
> lot symbols ...
>
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<'::text)
> Buffers: shared hit=1
> -> Bitmap Index Scan on idx_test (cost=0.00..4.20 rows=6 width=0)
> (actual time=4425.432..4425.432 rows=0 loops=1)
> Index Cond: ((test.value)::text = '<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> ... a lot symbols ...<<<<<<<<<<<<<<<<<<<<<<<<<<'::text)
> Buffers: shared hit=1
> Planning Time: 1.082 ms
> Execution Time: 4425.602 ms
> (9 rows)
>
> Time: 4433.001 ms (00:04.433)
>
> Observations*:*
>
> - The performance degradation occurs with certain special characters
> like < , !, >, @ , #, ... .
> - Queries using alphabetic characters or appending/prepending
> characters execute much faster.
> - The execution time increases exponentially with the length of the
> string composed of special characters.
> - Changing the collation to 'C' in the query significantly improves
> performance.
>
> Questions*:*
>
>
> - Is this performance degradation expected due to collation handling
> of certain special characters in PostgreSQL?
> - Are there any recommendations to improve performance without
> changing the column or database collation?
>
> --
> Best regards,
> Andrey Stikheev
>

--

*Best Regards *
*Shiv Iyer *

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Thomas Munro 2024-10-06 20:24:15 Re: Performance degradation in Index searches with special characters
Previous Message Tom Lane 2024-10-06 18:13:14 Re: Performance degradation in Index searches with special characters