From: | Michael Paquier <michael(dot)paquier(at)gmail(dot)com> |
---|---|
To: | Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> |
Cc: | a(dot)alekseev(at)postgrespro(dot)ru, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: SCRAM authentication, take three |
Date: | 2017-03-03 05:43:25 |
Message-ID: | CAB7nPqT6+4NU8aV+tpq=3OMkhgJkfLqdn5A=K0cmayjqu4qOaQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Mar 2, 2017 at 9:13 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> I'm studying the normalization of Unicode so I apologize possible
> stupidity in advance.
>
> I looked into this and have some comments. Sorry for the random
> order.
Thanks, this needs a lot of background and I am glad that somebody is
taking the time to look at what I am doing here.
> ====
> Composition version should be written some where.
Sure.
> ====
> Perhaps one of the most important things is that the code exactly
> reflects the TR. pg_utf8_check_string returns true for ASCII
> strings but the TR says that
>
> | Text containing only ASCII characters (U+0000 to U+007F) is left
> | unaffected by all of the normalization forms. This is
> | particularly important for programming languages
>
> And running SASLprepare for apparent ASCII string (which is the
> most case) is a waste of CPU cycles.
Yeah, that's true. We could just for example check in
pg_utf8_check_string() if the length gathered matches strlen(source)
as only ASCII are 1-byte long.
> ====
> From the point of view of reflectivity (please someone teach me an
> appropreate wording for this..), basically the code had better to
> be a copy of the reference code as long as no performance
> degradation occurs. Hangul part of get_decomposed_size(and
> decompose_code, recompose_code) uses different naming from the
> TR. hindex should be sindex and t should be tindex. Magic numbers
> should have names in the TR.
>
> * A bit later, I noticed that these are copies of charlint. If so
> I want a description about that.)
Yeah, their stuff works quite nicely.
> ====
> The following comment is equivalent to "reordering in canonical
> order". But the definition of "decomposition" includes this
> step. (I'm not sure if it needs rewriting, since I acutually
> could understand that.)
>
>> /*
>> * Now that the decomposition is done, apply the combining class
>> * for each multibyte character.
>> */
I have reworked a bit this one:
/*
- * Now that the decomposition is done, apply the combining class
- * for each multibyte character.
+ * Now end the decomposition by applying the combining class for
+ * each multibyte character.
*/
> ====
>> * make the allocation of the recomposed string based on that assumption.
>> */
>> recomp_chars = (pg_wchar *) malloc(decomp_size * sizeof(int));
>> lastClass = -1; /* this eliminates a special check */
>
> utf_sasl_prepare uses variable names with two naming
> conventions. Is there any reason for the two?
OK, did some adjustments here.
> ====
>> starterCh = recomp_chars[0] = decomp_chars[0];
>
> starterCh reads as "starter channel" why not "starter_char"?
Starter character of the current set, which is a character with a
combining class of 0.
> ====
>> else if (start >= SBASE && start < (SBASE + SCOUNT) &&
>> ((start - SBASE) % TCOUNT) == 0 &&
>> code >= TBASE && code <= (TBASE + TCOUNT))
>
> "code <= (TBASE + TCOUNT)" somewhat smells. Then I found the
> original code for the current implementation in charlint and it
> seems correct to me. Some description about the difference
> between them is needed.
Right. I have updated all those things to use constants instead of
hardcoded values.
> ====
> In use_sasl_prepare, the recompose part is the copy of charlint
> but it seems a bit inefficient. It calls recompose_code
> unconditionally but it is required only for the case of
> "lastClass < chClass".
>
> Something like this. (This still calls recompose_code for the
> case that ch is the same position with starterChar so there still
> be room for more improvement.)
Agreed.
> ====
> If I read the TR correctly, "6 Composition Exclusion Table" says
> that there some characters not to be composed. But I don't find
> the corresponding code. (or comments)
Ah, right! There are 100 characters that enter in this category, and
all of them have a combining class of 0, so it is as simple as
removing them from the tables generated.
I am attaching 0009 and 0010 that address those problems (pushed on
github as well) that can be applied on top of the latest set.
--
Michael
Attachment | Content-Type | Size |
---|---|---|
0009-Set-of-fixes-for-SASLprep.patch | application/octet-stream | 7.1 KB |
0010-Consider-characters-excluded-from-composition-in-con.patch | application/octet-stream | 14.6 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Kuntal Ghosh | 2017-03-03 05:53:00 | Re: Performance degradation in TPC-H Q18 |
Previous Message | Ashutosh Bapat | 2017-03-03 04:26:56 | Re: Print correct startup cost for the group aggregate. |