Re: Radix tree for character conversion

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: michael(dot)paquier(at)gmail(dot)com
Cc: hlinnaka(at)iki(dot)fi, daniel(at)yesql(dot)se, peter(dot)eisentraut(at)2ndquadrant(dot)com, robertmhaas(at)gmail(dot)com, tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, ishii(at)sraoss(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Radix tree for character conversion
Date: 2017-01-30 06:37:38
Message-ID: 20170130.153738.139030994.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello, this is the revised version of character conversion using radix tree.

At Fri, 27 Jan 2017 17:33:57 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20170127(dot)173357(dot)221584433(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> Hi, this is an intermediate report without a patch.
>
> At Thu, 26 Jan 2017 21:42:12 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20170126(dot)214212(dot)111556326(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> > > > 0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch
> > > >
> > > > Before adding radix tree stuff, applied pgperltidy and inserted
> > > > format-skipping pragma for the parts where perltidy seems to do
> > > > too much.
> > >
> > > Which version of perltidy did you use? Looking at the archives, the
> > > perl code is cleaned up with a specific version, v20090616. See
> > > https://www.postgresql.org/message-id/20151204054322.GA2070309@tornado.leadboat.com
> > > for example on the matter. As perltidy changes over time, this may be
> > > a sensitive change if done this way.
>
> My perltidy -v said "v20121207'. Anyway, I gave up to apply
> perltidy by myself. So I'll just drop 0003 and new 0004 (name
> changed to 0003) is made immediately on 0002.

I'm not sure what to handle this so I just removed the perltidy
stuff from this patchset.

> > > > 0004-Use-radix-tree-for-character-conversion.patch
> > > >
> > > > Radix tree body.
> > >
> > > Well, here a lot of diffs could have been saved.
> > >
> > > > The unattached fifth patch is generated by the following steps.
> > > >
> > > > [$(TOP)]$ ./configure
> > > > [Unicode]$ make
> > > > [Unicode]$ make distclean
> > > > [Unicode]$ git add .
> > > > [Unicode]$ commit
> > > > === COMMITE MESSSAGE
> > > > Replace map files with radix tree files.
> > > >
> > > > These encodings no longer uses the former map files and uses new radix
> > > > tree files.
> > > > ===
> > >
> > > OK, I can see that working, with 200k of maps generated.. So going
> > > through the important bits of this jungle..
> >
> > Many thaks for the exploration.
> >
> > > +/*
> > > + * radix tree conversion function - this should be identical to the function in
> > > + * ../conv.c with the same name
> ..
> > > This is not nice. Having a duplication like that is a recipe to forget
> > > about it as this patch introduces a dependency with conv.c and the
> > > radix tree generation.
>
> In the attatched patch, mb/char_conveter.c which contains one
> inline function is created and it is includ'ed from mb/conv.c and
> mb/Unicode/map_checker.c.
>
> > > Having a .gitignore in Unicode/ would be nice, particularly to avoid
> > > committing map_checker.
>
> I missed this. I added .gitignore to ignore map_checker stuff
> and authority files and old-style map files.
>
> > > A README documenting things may be welcome, or at least comments at
> > > the top of map_checker.c. Why is map_checker essential? What does it
> > > do? There is no way to understand that easily, except that it includes
> > > a "radix tree conversion function", and that it performs sanity checks
> > > on the radix trees to be sure that they are on a good shape. But as
> > > this something that one would guess only after looking at your patch
> > > and the code (at least I will sleep less stupid tonight after reading
> > > this stuff).
> >
> > Okay, I'll do that.
>
> The patch has not been provided yet, I'm going to put the
> following comment just before the main() in map_checker.c.
>
> /*
> * The old-style plain map files were error-resistant due to its
> * straight-forward way for generation from authority files. In contrast the
> * radix tree maps are generated by a rather complex calculation and have a
> * complex, hard-to-confirm format.
> *
> * This program runs sanity check of the radix tree maps by confirming all
> * characters in the plain map files to be converted to the same code by the
> * corresponding radix tree map.
> *
> * All map files are included by map_checker.h that is generated by the script
> * make_mapchecker.pl as the variable mappairs.
> *
> */
>
>
> I'll do the following things later.

The following is the continuation.

> > --- a/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
> > +++ b/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
> > # Drop these SJIS codes from the source for UTF8=>SJIS conversion
> > #<<< do not let perltidy touch this
> > -my @reject_sjis =(
> > +my @reject_sjis = (
> > 0xed40..0xeefc, 0x8754..0x875d, 0x878a, 0x8782,
> > - 0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797,
> > + 0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797,
> > 0x879a..0x879c
> > -);
> > + );
> > This is not generated, it would be nice to drop the noise from the patch.
>
> Mmm. I'm not sure how this is generated but I'll care for that.

I don't still understand what what the intermediate diff comes
from but copy-n-pasting from master silenced it...

> > Here is another one:
> > - $i->{code} = $jis | (
> > - $jis < 0x100
> > - ? 0x8e00
> > - : ($sjis >= 0xeffd ? 0x8f8080 : 0x8080));
> > -
> > +#<<< do not let perltidy touch this
> > + $i->{code} = $jis | ($jis < 0x100 ? 0x8e00:
> > + ($sjis >= 0xeffd ? 0x8f8080 : 0x8080));
> > +#>>>
>
> Ok. Will revert this.

The "previous" code (prefixed with a minus sign) is "my"
perltidy's work. Preltidy step is just removed from the patchset.

> > if (l == 2)
> > {
> > - iutf = *utf++ << 8;
> > - iutf |= *utf++;
> > + b3 = *utf++;
> > + b4 = *utf++;
> > }
> > Ah, OK. This conversion is important so as it performs a minimum of
> > bitwise operations. Yes let's keep that. That's pretty cool to get a
> > faster operation.
>
> It is Heikki's work:p
>
> I'll address them and repost the next version sooner.

Finally, the patchset had the following changes from the previous
shape.

- Avoid syntaxes perl 5.8 complains about

- The perltidy step has been removed.

- pg_mb_radix_conv is now a separate .c file (but included from
other c files)

- Added Unicode/.gitignore. The line for [~#] might be needless.

- The patchset are made with --irreversible-delete. This is just
what I wanted (but counldn't find by myself..)

- Semicolon-fix patch(0001) gets several additional fixes.

This patchset consists of four patches. The first two are bug
fixes back-patchable to older versions. The third one is the
patch that adds radix tree feature. The forth one is not attached
to this mail but generatable.

0001-Add-missing-semicolon.patch

Adds missing semicolon found in three files.

0002-Correct-reference-resolution-syntax.patch

Changes reference syntax to more preferable style.

0003-Use-radix-tree-for-character-conversion.patch

Radix tree conversion patch. The size has been reduced from
1.6MB to 91KB.

0004: Replace map files

This is not attached but generatable. This shouldn't fail even
on the environment with perl 5.8.

[$(TOP)]$ ./configure
[$(TOP)]$ cd src/backend/utils/mb/Unicode
[Unicode]$ make distclean maintainer-clean all
[Unicode]$ make mapcheck
...
All radix trees are perfect!
[Unicode]$ make distclean
[Unicode]$ git add .
[Unicode]$ git commit
...

The size of the forth patch was about 7.6MB using --irreversible-delete.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
0001-Add-missing-semicolon.patch text/x-patch 2.5 KB
0002-Correct-reference-resolution-syntax.patch text/x-patch 3.7 KB
0003-Use-radix-tree-for-character-conversion.patch text/x-patch 90.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2017-01-30 06:42:27 Re: Create a separate test file for exercising system views
Previous Message Rushabh Lathia 2017-01-30 06:36:29 Re: Query fails when SRFs are part of FROM clause (Commit id: 69f4b9c85f)