Inconsistent results with libc sorting on Windows

From: "Daniel Verite" <daniel(at)manitou-mail(dot)org>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Inconsistent results with libc sorting on Windows
Date: 2023-06-05 22:07:58
Message-ID: 1407a2c0-062b-4e4c-b728-438fdff5cb07@manitou-mail.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

While trying pg16beta1 libc collations on Windows, I noticed that UTF-8
text sorts sometimes differently across invocations with the same
locales, which is wrong since these collations are deterministic.

The OS is Windows 10 Home, version 10.0.19045 Build 19045,
self-built 16beta1 with VS Community 2022, without ICU, default
configuration in postgresql.conf.

It seems to occur more or less randomly with all libc locales except
C/POSIX, with the probability of getting differences being seemingly
much higher when the data gets larger in number of rows and uses
higher codepoints (like if all character are in [U+0001,U+0400] the
sorts never differ with 40k rows, but they do if there are much more
rows or if the range is [U+0001,U+2000]).

Also, it does not occur at all if parallel scan is disabled.

I've come up with a self-contained script that generates random words
and repeatedly sorts and feed them to md5sum. It takes the number of
rows and the highest Unicode codepoint as arguments, and shows when the
checksums differ across consecutive invocations.

Here's a typical run showing how it goes wrong after the 14th sort:

$ bash repro-coll-windows.sh 40000 16383
NOTICE: relation "random_words" already exists, skipping
CREATE TABLE
TRUNCATE TABLE
CREATE FUNCTION
DROP COLLATION
CREATE COLLATION
INSERT 0 40000
ANALYZE
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
35050d858f4c590788132627e74f62c8 -> e746b626fcc848cbbc67570a7dde03bb
(iter=15)
16
e746b626fcc848cbbc67570a7dde03bb -> 35050d858f4c590788132627e74f62c8
(iter=16)
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
35050d858f4c590788132627e74f62c8 -> 6bf38563d1267339122154bd7d4fbfce
(iter=38)
39
6bf38563d1267339122154bd7d4fbfce -> 35050d858f4c590788132627e74f62c8
(iter=39)
40 41 42 43 44 45 46 47 48 49 50 51
35050d858f4c590788132627e74f62c8 -> 3d2072698054d0bd57beefea0248b7e6
(iter=51)
52
3d2072698054d0bd57beefea0248b7e6 -> 35050d858f4c590788132627e74f62c8
(iter=52)
53 54 55 56 57 58 59 ^C

Would anyone be able to reproduce this? That might be a local problem
although there's nothing special installed AFAICS.
Initially I saw this with a larger dataset that I can't share, and the diffs
between outputs showed that only a few lines out of 2 million lines
were getting displaced across sorts.
It also happens on the same OS with Pg15.3 (EDB build) and the default
libc collation, so I would not immediately suspect new code in Pg16.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Attachment Content-Type Size
repro-coll-windows.sh application/octet-stream 1.6 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tejasvi Kashi 2023-06-05 22:18:01 Tracking commit LSNs of tuple xmins for read txns
Previous Message Jonah H. Harris 2023-06-05 21:07:52 Re: Let's make PostgreSQL multi-threaded