From: | "Joel Jacobson" <joel(at)compiler(dot)org> |
---|---|
To: | "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | "Mark Dilger" <mark(dot)dilger(at)enterprisedb(dot)com>, "Postgres hackers" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Andreas Karlsson" <andreas(at)proxel(dot)se>, "David Fetter" <david(at)fetter(dot)org>, "Gilles Darold" <gilles(at)darold(dot)net>, "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com> |
Subject: | Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[] |
Date: | 2021-03-09 19:30:21 |
Message-ID: | bf2222d5-909d-408b-8531-95b32f18d4ab@www.fastmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Mar 9, 2021, at 17:42, Tom Lane wrote:
> "Joel Jacobson" <joel(at)compiler(dot)org> writes:
> > Tom - can you please give details on your unpleasant experiences with parallel arrays?
>
> The problems I can recall running into were basically down to not having
> an easy way to iterate through parallel arrays. There are ways to do
> that in SQL, certainly, but they all constrain how you write the query,
> and usually force ugly stuff like splitting it into sub-selects.
I see now what you mean, many thanks for explaining.
>
> As an example, presuming that regexp_positions is defined along the
> lines of
>
> regexp_positions(str text, pat text, out starts int[], out lengths int[])
> returns setof record
+1
I think this is the most feasible best option so far.
Attached is a patch implementing it this way.
I changed the start to begin at 1, since this is how position ( substring text IN string text ) → integer works.
SELECT * FROM regexp_positions('foobarbequebaz', '^', 'g');
starts | lengths
--------+---------
{1} | {0}
(1 row)
SELECT * FROM regexp_positions('foobarbequebaz', 'ba.', 'g');
starts | lengths
--------+---------
{4} | {3}
{12} | {3}
(2 rows)
Mark's examples:
SELECT * FROM regexp_positions('foObARbEqUEbAz', $re$(?=beque)$re$, 'i');
starts | lengths
--------+---------
{7} | {0}
(1 row)
SELECT * FROM regexp_positions('foobarbequebaz', '(?<=z)', 'g');
starts | lengths
--------+---------
{15} | {0}
(1 row)
I've also tested your template queries:
>
> then to actually get the identified substrings you'd have to do something
> like
>
> select
> substring([input string] from starts[i] for lengths[i])
> from
> regexp_positions([input string], [pattern]) r,
> lateral
> generate_series(1, array_length(starts, 1)) i;
select
substring('foobarbequebaz' from starts[i] for lengths[i])
from
regexp_positions('foobarbequebaz', 'ba.', 'g') r,
lateral
generate_series(1, array_length(starts, 1)) i;
substring
-----------
bar
baz
(2 rows)
> I think the last time I confronted this, we didn't have multi-array
> UNNEST. Now that we do, we can get rid of the generate_series(),
> but it's still not beautiful:
>
> select
> substring([input string] from s for l)
> from
> regexp_positions([input string], [pattern]) r,
> lateral
> unnest(starts, lengths) u(s,l);
select
substring('foobarbequebaz' from s for l)
from
regexp_positions('foobarbequebaz', 'ba.', 'g') r,
lateral
unnest(starts, lengths) u(s,l);
substring
-----------
bar
baz
(2 rows)
> Having said that, the other alternative with a 2-D array:
>
> regexp_positions(str text, pat text) returns setof int[]
>
> seems to still need UNNEST, though now it's not the magic multi-array
> UNNEST but this slicing version:
>
> select
> substring([input string] from u[1] for u[2])
> from
> regexp_positions([input string], [pattern]) r,
> lateral
> unnest_slice(r, 1) u;
Unable to test this one since there is no unnest_slice() (yet)
>
> Anyway, I'd counsel trying to write out SQL implementations
> of regexp_matches() and other useful things based on any
> particular regexp_positions() API you might be thinking about.
> Can we do anything useful without a LATERAL UNNEST thingie?
> Are some of them more legible than others?
Hmm, I cannot think of a way.
/Joel
Attachment | Content-Type | Size |
---|---|---|
0004-regexp-positions.patch | application/octet-stream | 9.8 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Joe Conway | 2021-03-09 19:33:47 | Re: Procedures versus the "fastpath" API |
Previous Message | Matthias van de Meent | 2021-03-09 19:28:19 | Re: Lowering the ever-growing heap->pd_lower |