From: | Andrey Stikheev <andrey(dot)stikheev(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-performance(at)lists(dot)postgresql(dot)org |
Subject: | Re: Performance degradation in Index searches with special characters |
Date: | 2024-10-06 17:28:29 |
Message-ID: | CALM5VP-GPRXcxs5BEfyxmsYDsHb4K4y+VST00wZGzizhEVvUsA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Hi, Tom!
Thanks for your feedback. After looking into it further, it seems the
performance issue is indeed related to the default collation settings,
particularly when handling certain special characters like < in the glibc
strcoll_l function. This was confirmed during my testing on Debian 12 with
glibc version 2.36 (this OS and glibc are being used in our office's Docker
image: https://hub.docker.com/_/postgres)
My test database settings:
test_db=# \d+ test
Table "public.test"
Column | Type | Collation | Nullable | Default | Storage |
Compression | Stats target | Description
--------+-----------------------+-----------+----------+---------+----------+-------------+--------------+-------------
value | character varying(10) | | not null | | extended
| | |
Indexes:
"idx_test" btree (value)
Access method: heap
test_db=# \l test_db
List of databases
Name | Owner | Encoding | Locale Provider | Collate | Ctype |
Locale | ICU Rules | Access privileges
---------+----------+----------+-----------------+------------+------------+--------+-----------+-------------------
test_db | postgres | UTF8 | libc | en_US.utf8 | en_US.utf8 |
| |
(1 row)
strcoll_l tests:
root(at)715b19170a89:~# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
root(at)715b19170a89:~# ldd --version
ldd (Debian GLIBC 2.36-9+deb12u8) 2.36
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
root(at)715b19170a89:~# cat test.c
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
#include <string.h>
int main() {
char str1[] = "<";
char *str2 = malloc(65536); // 65535 characters + 1 for null terminator
if (!str2) return 1;
memset(str2, '<', 65535);
str2[65535] = '\0';
locale_t locale = newlocale(LC_COLLATE_MASK, "en_US.UTF-8", NULL);
int result = strcoll_l(str1, str2, locale);
printf("Comparison result: %d\n", result);
freelocale(locale);
free(str2);
return 0;
}
root(at)715b19170a89:~# time ./test
Comparison result: -1
real 0m4.487s
user 0m4.483s
sys 0m0.003s
I'm considering switching to ICU collations, as they might handle this more
efficiently. However, as I know ICU isn’t the default collation provider in
PostgreSQL, and switching to it in a live environment isn’t a
straightforward process. The main concern is that glibc’s default collation
(en_US.UTF-8) is widely used, and this opens up the potential for a Denial
of Service (DoS) attack. For instance, if user input includes long strings
of repeated characters like <, it can severely degrade performance due to
the extended processing time for string comparisons, especially in
high-traffic scenarios.
On Sun, 6 Oct 2024 at 19:39, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Andrey Stikheev <andrey(dot)stikheev(at)gmail(dot)com> writes:
> > - Changing the collation to 'C' in the query significantly improves
> > performance.
>
> What collation are you using, pray tell? (And what database encoding?)
> > - Is this performance degradation expected due to collation handling
> of
> > certain special characters in PostgreSQL?
>
> It seems like a performance bug indeed, but not ours --- I'm thinking
> it must be in strcoll() or ICU.
> Trying it here (on a RHEL8 machine) with en_US.utf8 locale, I see
> something similar but not quite as bad:
>
> u8=# SELECT 1 FROM test WHERE value = repeat('<', 65536);
> ?column?
> ----------
> (0 rows)
>
> Time: 1383.033 ms (00:01.383)
>
> Poking into it with gdb says that the time is indeed spent inside
> strcoll:
>
> #0 get_next_seq (pass=1, indirect=0x7fabb4988ea8, extra=0x7fabb4984900
> "",
> table=0x7fabb490e2b0, weights=<optimized out>,
> rulesets=0x7fabb490e2a8 "\001\002\001\005\001\001\001\005", nrules=4,
> seq=<synthetic pointer>) at strcoll_l.c:111
> #1 __GI___strcoll_l (s1=0x1785878 "<",
> s2=0x178587a '<' <repeats 200 times>..., l=<optimized out>)
> at strcoll_l.c:338
> #2 0x00000000009527a6 in strncoll_libc (arg1=<optimized out>, len1=1,
> arg2=<optimized out>, len2=65536, locale=<optimized out>,
> locale=<optimized out>) at pg_locale.c:1964
> #3 0x00000000009ac760 in varstr_cmp (arg1=0x7fabc2dcbfe9 "<", len1=1,
> arg2=0x17958cc '<' <repeats 200 times>..., len2=65536,
> collid=<optimized out>) at varlena.c:1567
> #4 0x00000000009acfe3 in bttextcmp (fcinfo=0x7ffddd3b0590) at
> varlena.c:1820
> #5 0x00000000009d75fa in FunctionCall2Coll (
> flinfo=flinfo(at)entry=0x7ffddd3b10e8, collation=<optimized out>,
> arg1=<optimized out>, arg2=<optimized out>) at fmgr.c:1161
> #6 0x0000000000594948 in _bt_compare (rel=0x7fabcde7eed0,
> key=0x7ffddd3b10c0,
> page=<optimized out>, offnum=<optimized out>) at nbtsearch.c:762
> #7 0x0000000000594e32 in _bt_binsrch (rel=rel(at)entry=0x7fabcde7eed0,
> key=key(at)entry=0x7ffddd3b10c0, buf=<optimized out>) at nbtsearch.c:394
>
> It's not the fault of the index machinery, because a single comparison
> takes the same amount of time:
>
> u8=# select '<' <= repeat('<', 65536);
> ?column?
> ----------
> t
> (1 row)
>
> Time: 1391.550 ms (00:01.392)
>
> I didn't try it with ICU.
>
> regards, tom lane
>
--
Best regards,
Andrey Stikheev
From | Date | Subject | |
---|---|---|---|
Next Message | Joe Conway | 2024-10-06 17:45:15 | Re: Performance degradation in Index searches with special characters |
Previous Message | Tom Lane | 2024-10-06 16:39:38 | Re: Performance degradation in Index searches with special characters |