From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> |
Cc: | Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com |
Subject: | Re: WIP: index support for regexp search |
Date: | 2013-01-23 02:08:58 |
Message-ID: | 2459.1358906938@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
> I finally got around to look at this. I like this new version, without
> the path matrix, much better.
I looked through this version too. I have some notes/issues:
The biggest problem is that I really don't care for the idea of
contrib/pg_trgm being this cozy with the innards of regex_t. Sooner
or later we are going to want to split the regex code off again as a
standalone library, and we *must* have a cleaner division of labor if
that is ever to happen. Not sure what a suitable API would look like
though.
I think the assumption that all MB characters fit in 4 bytes is
unacceptable; someday we'll want to support wider Unicode characters
than we do now, and this code seems utterly unable to handle it. It's
especially bad that the code isn't even bothering to defend itself
against the possibility of wider characters.
Can't just modify pg_trgm--1.0.sql in place, must create a "1.1" version
and an upgrade script.
Comments and documentation still need a lot of copy-editing, also I
think a lot of the comment blocks will not survive pg_indent. It'd be
a good idea to run trgm_regexp.c through pg_indent as soon as you have
that fixed.
New file trgm_regexp.c lacks a copyright notice
Calling RE_compile_and_cache with DEFAULT_COLLATION_OID is not good
enough; need to pass through the actual collation for the regex
operator.
How deep can the recursion in activateState() go? Is this a practical
production approach at all? Do we need a stack depth check there?
addKeys is also recursive, same questions. (But on the other hand, the
check_stack_depth in scanColorMap seems useless, since its recursion
depth is fixed.)
Not too happy with convertPgWchar: aside from hard-wired, unchecked
assumption about maximum length of pg_wchar2mb_with_len result, why is
it that this is doing a lowercase conversion? Surely the regex stuff
dealt with that already?
I'm suspicious of the code in addKeys() that is special-casing bos[1]
and eos[1] but not the other cases (BOL/EOL). My recollection from
working on the fixed-prefix stuff is that it's not always obvious which
of those states gets used.
strnlen() used in fillTrgm() is probably not portable, and is definitely
not used anywhere else in Postgres.
This isn't a complete review, just some stuff I happened to notice in
one quick pass over the code.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Stephen Frost | 2013-01-23 02:45:37 | Re: lazy_vacuum_heap()'s removal of HEAPTUPLE_DEAD tuples |
Previous Message | Craig Ringer | 2013-01-23 01:28:54 | Re: Patch: clean up addRangeTableEntryForFunction |