From: | Renan Alves Fonseca <renanfonseca(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Remove restrictions in recursive query |
Date: | 2025-03-27 21:21:32 |
Message-ID: | CAN_p2QjTZ4M+o=5hkpTR1UYv-ga+2+9EuyHY=ho6ib9ys3O_YQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Mar 27, 2025 at 7:32 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> > I'll assume that the silence about allowing GROUP BY means it is not a
> > great idea...
>
> I don't think there's really anything to keep you from doing this --
> just put the grouping operation where you refer to the recursive CTE,
> instead of inside the recursive CTE itself. I think allowing it to
> appear inside the recursive CTE would be rather confusing -- it's
> probably best if the mandatory UNION operation is at the top level.
>
I don't think I was clear enough about the proposed improvements. I
hope this example will clarify.
Consider a table representing the edges of a directed acyclic graph.
The following lines will create an example of such a table.
-- create DAG example
\set output_degree 10
\set maxN 20000
create temp table dag(source,target) as
(with recursive t1(source,target) as
(select 1,1
union
select target, target * i from t1, generate_series(2,:output_degree +1) i
where target*i< :maxN
) select * from t1 where source!=target );
Then, suppose we want to count the number of distinct paths from the
root (1) to all sink nodes. We use the following auxiliary table for
the sink nodes:
create temp table sink_nodes(node) as
(select target from dag except select source from dag);
The solution supported in PostgreSQL is the following:
with recursive t1(node,path) as
(select 1,array[1]
union
select dag.target, t1.path||dag.target
from t1 join dag on t1.node=dag.source
) select count(*) from t1 join sink_nodes using (node) ;
This query enumerates every path and probably has something like
exponential complexity on the number of edges. It gives these results
in my laptop:
count
---------
1114163
(1 row)
Time: 8121.044 ms (00:08.121)
The solution using GROUP BY in the recursive query is the following:
with recursive t1(node,nb_path) as
(select 1,1::numeric
union all
(select dag.target, sum(nb_path)
from t1 join dag on t1.node=dag.source
group by 1)
) select sum(nb_path) from t1 join sink_nodes using (node) ;
At every iteration step, we sum the number of paths that arrived at a
given node. That is another class of algorithm complexity. The first
query cannot scale to larger inputs, while this one can scale. Here is
the result:
sum
---------
1114163
(1 row)
Time: 27.123 ms
I hope this example made clear that allowing the GROUP BY in the
recursive clause significantly increases the capabilities of doing
analytics on graph data, and it is not only syntax sugar. Please let
me know if there is another way of handling this problem efficiently
without the proposed change.
Thanks again for your attention,
Renan
From | Date | Subject | |
---|---|---|---|
Next Message | Andrei Lepikhov | 2025-03-27 21:34:31 | Re: making EXPLAIN extensible |
Previous Message | Tom Lane | 2025-03-27 21:17:16 | Re: Remove restrictions in recursive query |