From: | John Naylor <john(dot)naylor(at)enterprisedb(dot)com> |
---|---|
To: | Vladimir Sitnikov <sitnikov(dot)vladimir(at)gmail(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu> |
Subject: | Re: speed up verifying UTF-8 |
Date: | 2021-07-26 12:56:52 |
Message-ID: | CAFBsxsH8i1H2Us0StA6WUD9WqBrQhysrk=uxcZdTJSxnSS7T1g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Jul 26, 2021 at 7:55 AM Vladimir Sitnikov <
sitnikov(dot)vladimir(at)gmail(dot)com> wrote:
>
> Just wondering, do you have the code in a GitHub/Gitlab branch?
>
> >+ utf8_advance(s, state, len);
> >+
> >+ /*
> >+ * If we saw an error during the loop, let the caller handle it. We
treat
> >+ * all other states as success.
> >+ */
> >+ if (state == ERR)
> >+ return 0;
>
> Did you mean state = utf8_advance(s, state, len); there? (reassign state
variable)
Yep, that's a bug, thanks for catching!
> >I wanted to try different strides for the DFA
>
> Does that (and "len >= 32" condition) mean the patch does not improve
validation of the shorter strings (the ones less than 32 bytes)?
Right. Also, the 32 byte threshold was just a temporary need for testing
32-byte stride -- testing different thresholds wouldn't hurt. I'm not
terribly concerned about short strings, though, as long as we don't
regress. That said, Heikki had something in his v14 [1] that we could use:
+/*
+ * Subroutine of pg_utf8_verifystr() to check on char. Returns the length
of the
+ * character at *s in bytes, or 0 on invalid input or premature end of
input.
+ *
+ * XXX: could this be combined with pg_utf8_verifychar above?
+ */
+static inline int
+pg_utf8_verify_one(const unsigned char *s, int len)
It would be easy to replace pg_utf8_verifychar with this. It might even
speed up the SQL function length_in_encoding() -- that would be a better
reason to do it.
[1]
https://www.postgresql.org/message-id/2f95e70d-4623-87d4-9f24-ca534155f179%40iki.fi
--
John Naylor
EDB: http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | John Naylor | 2021-07-26 12:58:37 | Re: speed up verifying UTF-8 |
Previous Message | Simon Riggs | 2021-07-26 12:53:20 | Re: Skip temporary table schema name from explain-verbose output. |