| From: | Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp> | 
|---|---|
| To: | pgsql-bugs(at)postgresql(dot)org | 
| Cc: | pgsql-hackers(at)postgresql(dot)org | 
| Subject: | Re: make_greater_string() does not return a string in some cases | 
| Date: | 2011-07-08 09:21:16 | 
| Message-ID: | 20110708.182116.44187733.horiguchi.kyotaro@oss.ntt.co.jp | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-bugs pgsql-hackers | 
Hello, Could you let me go on with this topic?
It is hard to ignore this glitch for us using CJK - Chinese,
Japanese, and Korean - characters on databse.. Maybe..
Saying on Japanese under the standard usage, about a hundred
characters out of seven thousand make make_greater_string() fail.
This is not so frequent to happen but also not as rare as
ignorable.
I think this glitch is caused because the method to derive the
`next character' is fundamentally a secret of each encoding but
now it is done in make_greater_string() using the method extended
from that of 1 byte ASCII charset for all encodings together.
 So, I think it is reasonable that encoding info table (struct
pg_wchar_tbl) holds the function to do that.
How about this idea?
Points to realize this follows,
- pg_wchar_tbl(at)pg_wchar(dot)c has new element `charinc' that holds a
  function to increment a character of this encoding.
- Basically, the value of charinc is a `generic' increment
  function that does what make_greater_string() does in current
  implement.
- make_greater_string() now uses charinc for database encoding to
  increment characters instead of the code directly written in
  it.
- Give UTF-8 a special increment function.
As a consequence of this modification, make_greater_string()
looks somewhat simple thanks to disappearing of the sequence that
handles bare bytes in string.  And doing `increment character'
with the knowledge of the encoding can be straightforward and
light and backtrack-free, and have fewer glitches than the
generic method.
# But the process for BYTEAOID remains there dissapointingly.
There still remains some glitches but I think it is overdo to do
conversion that changes the length of the character. Only 5
points out of 17 thousands (in current method, roughly for all
BMP characters) remains, and none of them are not Japanese
character :-)
The attached patch is sample implement of this idea.
What do you think about this patch?
-- 
Kyotaro Horiguchi
NTT Open Source Software Center
| Attachment | Content-Type | Size | 
|---|---|---|
| unknown_filename | text/plain | 16.2 KB | 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Dmitry | 2011-07-08 11:33:09 | BUG #6101: ALTER TABLE hangs with AccessExclusiveLock | 
| Previous Message | zhaowy | 2011-07-08 08:20:45 | BUG #6099: Does pgcluster support hibernate? | 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Heikki Linnakangas | 2011-07-08 09:26:54 | Re: [COMMITTERS] pgsql: Adjust OLDSERXID_MAX_PAGE based on BLCKSZ. | 
| Previous Message | Kohei KaiGai | 2011-07-08 09:09:54 | Re: [v9.2] Fix leaky-view problem, part 2 |