From: | Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp> |
---|---|
To: | pgsql-bugs(at)postgresql(dot)org |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: make_greater_string() does not return a string in some cases |
Date: | 2011-07-08 09:21:16 |
Message-ID: | 20110708.182116.44187733.horiguchi.kyotaro@oss.ntt.co.jp |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs pgsql-hackers |
Hello, Could you let me go on with this topic?
It is hard to ignore this glitch for us using CJK - Chinese,
Japanese, and Korean - characters on databse.. Maybe..
Saying on Japanese under the standard usage, about a hundred
characters out of seven thousand make make_greater_string() fail.
This is not so frequent to happen but also not as rare as
ignorable.
I think this glitch is caused because the method to derive the
`next character' is fundamentally a secret of each encoding but
now it is done in make_greater_string() using the method extended
from that of 1 byte ASCII charset for all encodings together.
So, I think it is reasonable that encoding info table (struct
pg_wchar_tbl) holds the function to do that.
How about this idea?
Points to realize this follows,
- pg_wchar_tbl(at)pg_wchar(dot)c has new element `charinc' that holds a
function to increment a character of this encoding.
- Basically, the value of charinc is a `generic' increment
function that does what make_greater_string() does in current
implement.
- make_greater_string() now uses charinc for database encoding to
increment characters instead of the code directly written in
it.
- Give UTF-8 a special increment function.
As a consequence of this modification, make_greater_string()
looks somewhat simple thanks to disappearing of the sequence that
handles bare bytes in string. And doing `increment character'
with the knowledge of the encoding can be straightforward and
light and backtrack-free, and have fewer glitches than the
generic method.
# But the process for BYTEAOID remains there dissapointingly.
There still remains some glitches but I think it is overdo to do
conversion that changes the length of the character. Only 5
points out of 17 thousands (in current method, roughly for all
BMP characters) remains, and none of them are not Japanese
character :-)
The attached patch is sample implement of this idea.
What do you think about this patch?
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachment | Content-Type | Size |
---|---|---|
unknown_filename | text/plain | 16.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Dmitry | 2011-07-08 11:33:09 | BUG #6101: ALTER TABLE hangs with AccessExclusiveLock |
Previous Message | zhaowy | 2011-07-08 08:20:45 | BUG #6099: Does pgcluster support hibernate? |
From | Date | Subject | |
---|---|---|---|
Next Message | Heikki Linnakangas | 2011-07-08 09:26:54 | Re: [COMMITTERS] pgsql: Adjust OLDSERXID_MAX_PAGE based on BLCKSZ. |
Previous Message | Kohei KaiGai | 2011-07-08 09:09:54 | Re: [v9.2] Fix leaky-view problem, part 2 |