Re: Performance degradation in Index searches with special characters

From: Andrey Stikheev <andrey(dot)stikheev(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Re: Performance degradation in Index searches with special characters
Date: 2024-10-06 17:28:29
Message-ID: CALM5VP-GPRXcxs5BEfyxmsYDsHb4K4y+VST00wZGzizhEVvUsA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi, Tom!

Thanks for your feedback. After looking into it further, it seems the
performance issue is indeed related to the default collation settings,
particularly when handling certain special characters like < in the glibc
strcoll_l function. This was confirmed during my testing on Debian 12 with
glibc version 2.36 (this OS and glibc are being used in our office's Docker
image: https://hub.docker.com/_/postgres)

My test database settings:

test_db=# \d+ test

Table "public.test"

Column | Type | Collation | Nullable | Default | Storage |
Compression | Stats target | Description

--------+-----------------------+-----------+----------+---------+----------+-------------+--------------+-------------

value | character varying(10) | | not null | | extended
| | |

Indexes:

"idx_test" btree (value)

Access method: heap

test_db=# \l test_db

List of databases

Name | Owner | Encoding | Locale Provider | Collate | Ctype |
Locale | ICU Rules | Access privileges

---------+----------+----------+-----------------+------------+------------+--------+-----------+-------------------

test_db | postgres | UTF8 | libc | en_US.utf8 | en_US.utf8 |
| |

(1 row)

strcoll_l tests:

root(at)715b19170a89:~# cat /etc/os-release

PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"

NAME="Debian GNU/Linux"

VERSION_ID="12"

VERSION="12 (bookworm)"

VERSION_CODENAME=bookworm

ID=debian

HOME_URL="https://www.debian.org/"

SUPPORT_URL="https://www.debian.org/support"

BUG_REPORT_URL="https://bugs.debian.org/"

root(at)715b19170a89:~# ldd --version

ldd (Debian GLIBC 2.36-9+deb12u8) 2.36

Copyright (C) 2022 Free Software Foundation, Inc.

This is free software; see the source for copying conditions. There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Written by Roland McGrath and Ulrich Drepper.

root(at)715b19170a89:~# cat test.c

#include <stdio.h>

#include <stdlib.h>

#include <locale.h>

#include <string.h>

int main() {

char str1[] = "<";

char *str2 = malloc(65536); // 65535 characters + 1 for null terminator

if (!str2) return 1;

memset(str2, '<', 65535);

str2[65535] = '\0';

locale_t locale = newlocale(LC_COLLATE_MASK, "en_US.UTF-8", NULL);

int result = strcoll_l(str1, str2, locale);

printf("Comparison result: %d\n", result);

freelocale(locale);

free(str2);

return 0;

}

root(at)715b19170a89:~# time ./test

Comparison result: -1

real 0m4.487s

user 0m4.483s

sys 0m0.003s

I'm considering switching to ICU collations, as they might handle this more
efficiently. However, as I know ICU isn’t the default collation provider in
PostgreSQL, and switching to it in a live environment isn’t a
straightforward process. The main concern is that glibc’s default collation
(en_US.UTF-8) is widely used, and this opens up the potential for a Denial
of Service (DoS) attack. For instance, if user input includes long strings
of repeated characters like <, it can severely degrade performance due to
the extended processing time for string comparisons, especially in
high-traffic scenarios.

On Sun, 6 Oct 2024 at 19:39, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Andrey Stikheev <andrey(dot)stikheev(at)gmail(dot)com> writes:
> > - Changing the collation to 'C' in the query significantly improves
> > performance.
>
> What collation are you using, pray tell? (And what database encoding?)

> > - Is this performance degradation expected due to collation handling
> of
> > certain special characters in PostgreSQL?
>
> It seems like a performance bug indeed, but not ours --- I'm thinking
> it must be in strcoll() or ICU.

> Trying it here (on a RHEL8 machine) with en_US.utf8 locale, I see
> something similar but not quite as bad:
>
> u8=# SELECT 1 FROM test WHERE value = repeat('<', 65536);
> ?column?
> ----------
> (0 rows)
>
> Time: 1383.033 ms (00:01.383)
>
> Poking into it with gdb says that the time is indeed spent inside
> strcoll:
>
> #0 get_next_seq (pass=1, indirect=0x7fabb4988ea8, extra=0x7fabb4984900
> "",
> table=0x7fabb490e2b0, weights=<optimized out>,
> rulesets=0x7fabb490e2a8 "\001\002\001\005\001\001\001\005", nrules=4,
> seq=<synthetic pointer>) at strcoll_l.c:111
> #1 __GI___strcoll_l (s1=0x1785878 "<",
> s2=0x178587a '<' <repeats 200 times>..., l=<optimized out>)
> at strcoll_l.c:338
> #2 0x00000000009527a6 in strncoll_libc (arg1=<optimized out>, len1=1,
> arg2=<optimized out>, len2=65536, locale=<optimized out>,
> locale=<optimized out>) at pg_locale.c:1964
> #3 0x00000000009ac760 in varstr_cmp (arg1=0x7fabc2dcbfe9 "<", len1=1,
> arg2=0x17958cc '<' <repeats 200 times>..., len2=65536,
> collid=<optimized out>) at varlena.c:1567
> #4 0x00000000009acfe3 in bttextcmp (fcinfo=0x7ffddd3b0590) at
> varlena.c:1820
> #5 0x00000000009d75fa in FunctionCall2Coll (
> flinfo=flinfo(at)entry=0x7ffddd3b10e8, collation=<optimized out>,
> arg1=<optimized out>, arg2=<optimized out>) at fmgr.c:1161
> #6 0x0000000000594948 in _bt_compare (rel=0x7fabcde7eed0,
> key=0x7ffddd3b10c0,
> page=<optimized out>, offnum=<optimized out>) at nbtsearch.c:762
> #7 0x0000000000594e32 in _bt_binsrch (rel=rel(at)entry=0x7fabcde7eed0,
> key=key(at)entry=0x7ffddd3b10c0, buf=<optimized out>) at nbtsearch.c:394
>
> It's not the fault of the index machinery, because a single comparison
> takes the same amount of time:
>
> u8=# select '<' <= repeat('<', 65536);
> ?column?
> ----------
> t
> (1 row)
>
> Time: 1391.550 ms (00:01.392)
>
> I didn't try it with ICU.
>
> regards, tom lane
>

--
Best regards,
Andrey Stikheev

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Joe Conway 2024-10-06 17:45:15 Re: Performance degradation in Index searches with special characters
Previous Message Tom Lane 2024-10-06 16:39:38 Re: Performance degradation in Index searches with special characters