Quick Links

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From:	Sean Chittenden <sean(at)chittenden(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jon Jensen <jon(at)endpoint(dot)com>, Neil Conway <neilc(at)samurai(dot)com>, wade <wade(at)wavefire(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: POSIX regex performance bug in 7.3 Vs. 7.2
Date:	2003-02-04 19:02:25
Message-ID:	20030204190225.GD15936@perrin.int.nxad.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

> > It would be a delight to be able to use more advanced (IMHO) Perl-
> > compatible regexes in PostgreSQL.
>
> After some further research, pcre does seem like an interesting
> alternative. Both pcre and Spencer's new code have essentially
> Berkeley-style licenses, so there's no problem there. Some relevant
> comparisons:
>
> 1. pcre tries to be exactly compatible with Perl, so details of its
> regex flavor will be familiar to many more people than the Tcl flavor
> (by and large the features are similar, but there are differences).

pcre is lgpl, iirc. Ruby went off and wrote an explicitly BSD
licensed regexp engine to replace it's GPL'ed Perl/pcre based bits.

> 2. pcre is already distributed as a nice tidy library; we need not
> extract code from the Tcl distribution.

http://www.ruby-lang.org/cgi-bin/cvsweb.cgi/oniguruma/

> 3. pcre is actively maintained (although tracking a new release every
> couple months may not be something we really want to do, anyway).
> AFAICT Henry's not doing anything much with his code, so it'd be
> pretty much take-once-and-maintain-for-ourselves.

Oniguruma is pretty well maintained given that it's Ruby's regexp
engine, and has the perk of being maintained outside of Ruby as a
standalone module that gets periodically imported.

> 4. pcre looks like it's probably *not* as well suited to a multibyte
> environment. In particular, I doubt that its UTF8 compile option
> was even turned on for the performance comparison Neil cited --- and
> the man page only promises "experimental, incomplete support for
> UTF-8 encoded strings". The Tcl code by contrast is used only in a
> multibyte environment, so that's the supported, optimized path. It
> doesn't even assume null-terminated strings (yay).

Oniguruma only supports ASCII, UTF-8, EUC-JP, and Shift_JIS, but
boasts being 10-20% faster than PCRE for ASCII (no clue about
multi-byte character sets). In terms of development/API, it supports
the GNU regex, POSIX, Oniguruma APIs (the latter is what ruby uses to
hook in).

Just another option to add to the table, don't know if it fully fits
our requirements, but since it is actively being developed by
resources outside of this project, and it has support for 16-bit and
32-bit encodings (UCS-2, UCS-4, UTF-16) is on the TODO list, it might
be nice to keep this in mind and let Ruby maintain it instead of
PostgreSQL.

-sc

--
Sean Chittenden

In response to

Re: POSIX regex performance bug in 7.3 Vs. 7.2 at 2003-02-04 18:21:31 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2003-02-04 19:18:24	Re: POSIX regex performance bug in 7.3 Vs. 7.2
Previous Message	greg	2003-02-04 18:57:27	Re: PGP Signing ...