From: | "Joel Jacobson" <joel(at)compiler(dot)org> |
---|---|
To: | "David Fetter" <david(at)fetter(dot)org> |
Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: regexp_positions() |
Date: | 2021-02-28 03:58:05 |
Message-ID: | 4d4a4d83-efae-414b-878d-46e1b9d64738@www.fastmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On Sun, Feb 28, 2021, at 03:13, David Fetter wrote:
> Maybe an int4multirange, which would fit unless I'm misunderstanding
> g's meaning with respect to non-overlapping patterns, but that might
> be a little too cute and not easy ever to extend.
>
> Come to that, would a row structure that looked like
>
> (match, start, end)
>
> be useful?
Nice, didn't know about the new multirange.
Its data structure seems like a perfect fit for this.
Hmm, I cannot find any catalog function to extract the ranges from the data structure though?
As a caller, I might need the exact start/end values,
not just wanting to know if a certain values overlaps any of the ranges.
Is there such a function?
Here is a PoC that just returns the start_pos and end_pos for all the matches.
It would be simple to modify it to instead return multirange.
CREATE OR REPLACE FUNCTION regexp_ranges(string text, pattern text, OUT start_pos integer, OUT end_pos integer)
RETURNS SETOF RECORD
LANGUAGE plpgsql
AS
$$
DECLARE
match text;
remainder text := string;
len integer;
BEGIN
end_pos := 0;
--
-- Ignore possible capture groups, instead just wrap the entire regex
-- in an enclosing capture group, which is then extracted as the first array element.
--
FOR match IN SELECT (regexp_matches(string,format('(%s)',pattern),'g'))[1] LOOP
len := length(match);
start_pos := position(match in remainder) + end_pos;
end_pos := start_pos + len - 1;
RETURN NEXT;
remainder := right(remainder, -len);
END LOOP;
RETURN;
END
$$;
This works fine for small strings:
SELECT * FROM regexp_ranges('aaaa aa aaa','a+');
start_pos | end_pos
-----------+---------
1 | 4
6 | 7
10 | 12
(3 rows)
Time: 0.209 ms
But quickly gets slow for longer strings:
SELECT COUNT(*) FROM regexp_ranges(repeat('aaaa aa aaa',10000),'a+');
20001
Time: 98.663 ms
SELECT COUNT(*) FROM regexp_ranges(repeat('aaaa aa aaa',20000),'a+');
40001
Time: 348.027 ms
SELECT COUNT(*) FROM regexp_ranges(repeat('aaaa aa aaa',30000),'a+');
60001
Time: 817.412 ms
SELECT COUNT(*) FROM regexp_ranges(repeat('aaaa aa aaa',40000),'a+');
80001
Time: 1478.438 ms (00:01.478)
Compared to the much nicer observed O-notation for regexp_matches():
SELECT COUNT(*) FROM regexp_matches(repeat('aaaa aa aaa',10000),'(a+)','g');
20001
Time: 12.602 ms
SELECT COUNT(*) FROM regexp_matches(repeat('aaaa aa aaa',20000),'(a+)','g');
40001
Time: 25.161 ms
SELECT COUNT(*) FROM regexp_matches(repeat('aaaa aa aaa',30000),'(a+)','g');
60001
Time: 44.795 ms
SELECT COUNT(*) FROM regexp_matches(repeat('aaaa aa aaa',40000),'(a+)','g');
80001
Time: 57.292 ms
/Joel
From | Date | Subject | |
---|---|---|---|
Next Message | Justin Pryzby | 2021-02-28 04:18:21 | Re: [HACKERS] Custom compression methods |
Previous Message | David Fetter | 2021-02-28 02:13:48 | Re: regexp_positions() |