From: | John Naylor <john(dot)naylor(at)enterprisedb(dot)com> |
---|---|
To: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: [POC] verifying UTF-8 using SIMD instructions |
Date: | 2021-02-20 21:10:58 |
Message-ID: | CAFBsxsFgKt3ktbnghM_5LyTXEov5+XNx5cJ+E6AbL+3Rh-XKcw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I made some substantial improvements in v5, and I've taken care of all my
TODOs below. I separated out the non-UTF-8 ascii fast path into a separate
patch, since it's kind of off-topic, and it's not yet clear it's always the
best thing to do.
> - It takes almost no recognizable code from simdjson, but it does take
the magic constants lookup tables almost verbatim. The main body of the
code has no intrinsics at all (I think). They're all hidden inside static
inline helper functions. I reused some cryptic variable names from
simdjson. It's a bit messy but not terrible.
In v5, the lookup tables and their comments are cleaned up and modified to
play nice with pgindent.
> - It diffs against the noError conversion patch and adds additional tests.
I wanted to get some cfbot testing, so I went ahead and prepended v4 of
Heikki's noError patch so it would apply against master.
> - There is no ascii fast-path yet. With this algorithm we have to be a
bit more careful since a valid ascii chunk could be preceded by an
incomplete sequence at the end of the previous chunk. Not too hard, just a
bit more work.
v5 adds an ascii fast path.
> - I had to add a large number of casts to get rid of warnings in the
magic constants macros. That needs some polish.
This is much nicer now, only one cast really necessary.
I'm pretty pleased with how it is now, but it could use some thorough
testing for correctness. I'll work on that a bit later.
On my laptop, Clang 10:
master:
chinese | mixed | ascii
---------+-------+-------
1081 | 761 | 366
v5:
chinese | mixed | ascii
---------+-------+-------
136 | 93 | 54
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
v4-0001-Add-noError-argument-to-encoding-conversion-funct.patch | application/octet-stream | 230.6 KB |
v5-0002-Use-SSE-4-for-verifying-UTF-8-text.patch | application/octet-stream | 49.8 KB |
v5-0003-Add-an-ASCII-fast-path-to-non-UTF-8-encoding-veri.patch | application/octet-stream | 3.9 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Guillaume Lelarge | 2021-02-20 21:38:36 | Re: Extensions not dumped when --schema is used |
Previous Message | Markus Wanner | 2021-02-20 20:44:30 | Re: [PATCH] Present all committed transaction to the output plugin |