From: | Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> |
---|---|
To: | tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com |
Cc: | hlinnaka(at)iki(dot)fi, tgl(at)sss(dot)pgh(dot)pa(dot)us, ishii(at)sraoss(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Supporting SJIS as a database encoding |
Date: | 2016-09-08 06:35:46 |
Message-ID: | 20160908.153546.187438961.horiguchi.kyotaro@lab.ntt.co.jp |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello,
At Wed, 07 Sep 2016 16:13:04 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20160907(dot)161304(dot)112519789(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> > Implementing radix tree code, then redefining the format of mapping table
> > > to suppot radix tree, then modifying mapping generator script are needed.
> > >
> > > If no one oppse to this, I'll do that.
So, I did that as a PoC. The radix tree takes a little less than
100k bytes (far smaller than expected:) and it is defnitely
faster than binsearch.
The attached patch does the following things.
- Defines a struct for static radix tree
(utf_radix_tree). Currently it supports up to 3-byte encodings.
- Adds a map generator script UCS_to_SJIS_radix.pl, which
generates utf8_to_sjis_radix.map from utf8_to_sjis.map.
- Adds a new conversion function utf8_to_sjis_radix.
- Modifies UtfToLocal so as to allow map to be NULL.
- Modifies utf8_to_sjis to use the new conversion function
instead of ULmapSJIS.
The followings are to be done.
- utf8_to_sjis_radix could be more generic.
- SJIS->UTF8 is not implemented but it would be easily done since
there's no difference in using the radix tree mechanism.
(but the output character is currently assumed to be 2-byte long)
- It doesn't support 4-byte codes so this is not applicable to
sjis_2004. Extending the radix tree to support 4-byte wouldn't
be hard.
The following is the result of a simple test.
=# create table t (a text); alter table t alter column a storage plain;
=# insert into t values ('... 7130 cahracters containing (I believe) all characters in SJIS encoding');
=# insert into t values ('... 7130 cahracters containing (I believe) all characters in SJIS encoding');
# Doing that twice is just my mistake.
$ export PGCLIENTENCODING=SJIS
$ time psql postgres -c '
$ psql -c '\encoding' postgres
SJIS
<Using radix tree>
$ time psql postgres -c 'select t.a from t, generate_series(0, 9999)' > /dev/null
real 0m22.696s
user 0m16.991s
sys 0m0.182s>
Using binsearch the result for the same operation was
real 0m35.296s
user 0m17.166s
sys 0m0.216s
Returning in UTF-8 bloats the result string by about 1.5 times so
it doesn't seem to make sense comparing with it. But it takes
real = 47.35s.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachment | Content-Type | Size |
---|---|---|
0001-Use-radix-tree-to-encoding-characters-of-Shift-JIS.patch.gz | application/octet-stream | 42.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Михаил Бахтерев | 2016-09-08 06:39:31 | Re: GiST penalty functions [PoC] |
Previous Message | Noah Misch | 2016-09-08 05:46:32 | Re: Parallel build with MSVC |