From: | David Johnston <polobo(at)yahoo(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Patch: regexp_matches variant returning an array of matching positions |
Date: | 2014-01-29 04:16:27 |
Message-ID: | 1390968987243-5789414.post@n5.nabble.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Alvaro Herrera-9 wrote
> Björn Harrtell wrote:
>> I've written a variant of regexp_matches called regexp_matches_positions
>> which instead of returning matching substrings will return matching
>> positions. I found use of this when processing OCR scanned text and
>> wanted
>> to prioritize matches based on their position.
>
> Interesting. I didn't read the patch but I wonder if it would be of
> more general applicability to return more info in a fell swoop a
> function returning a set (position, length, text of match), rather than
> an array. So instead of first calling one function to get the match and
> then their positions, do it all in one pass.
>
> (See pg_event_trigger_dropped_objects for a simple example of a function
> that returns in that fashion. There are several others but AFAIR that's
> the simplest one.)
Confused as to your thinking. Like regexp_matches this returns "SETOF
type[]". In this case integer but text for the matches. I could see adding
a generic function that returns a SETOF named composite (match varchar[],
position int[], length int[]) and the corresponding type. I'm not imagining
a situation where you'd want the position but not the text and so having to
evaluate the regexp twice seems wasteful. The length is probably a waste
though since it can readily be gotten from the text and is less often
needed. But if it's pre-calculated anyway...
My question is what position is returned in a multiple-match situation? The
supplied test only covers the simple, non-global, situation. It needs to
exercise empty sub-matches and global searches. One theory is that the
first array slot should cover the global position of match zero (i.e., the
entire pattern) within the larger document while sub-matches would be
relative offsets within that single match. This conflicts, though, with the
fact that _matches only returns array elements for () items and never for
the full match - the goal in this function being parallel un-nesting. But as
nesting is allowed it is still possible to have occur.
How does this resolve in the patch?
SELECT regexp_matches('abcabc','((a)(b)(c))','g');
David J.
--
View this message in context: http://postgresql.1045698.n5.nabble.com/Patch-regexp-matches-variant-returning-an-array-of-matching-positions-tp5789321p5789414.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2014-01-29 04:39:14 | Re: Observed Compilation warning in WIN32 build |
Previous Message | Robert Haas | 2014-01-29 04:13:53 | Re: updated emacs configuration |