Re: Do we want a hashset type?

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Ants Aasma" <ants(at)cybertec(dot)at>
Cc: "Tomas Vondra" <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Do we want a hashset type?
Date: 2023-06-05 09:27:53
Message-ID: 1c59f2c3-a70f-4727-b7e1-b7aea62d818a@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 2, 2023, at 10:01, Ants Aasma wrote:
> Have you looked at roaring bitmaps? There is a pg_roaringbitmap
> extension [1] already available that offers very fast unions,
> intersections and membership tests over integer sets. I used it to get
> some pretty impressive performance results for faceting search on
> large document sets. [2]

Many thanks for the tip!

New benchmark:

We already had since before:
> wget https://snap.stanford.edu/data/soc-pokec-relationships.txt.gz
> gunzip soc-pokec-relationships.txt.gz
> CREATE TABLE edges (from_node INT, to_node INT);
> \copy edges from soc-pokec-relationships.txt;
> ALTER TABLE edges ADD PRIMARY KEY (from_node, to_node);

I've created a new `users` table from the `edges` table,
with a new `friends` roaringbitmap column:

CREATE TABLE users AS
SELECT from_node AS id, rb_build_agg(to_node) AS friends FROM edges GROUP BY 1;
ALTER TABLE users ADD PRIMARY KEY (id);

Old query from before:

CREATE OR REPLACE VIEW friends_of_friends AS
WITH RECURSIVE friends_of_friends AS (
SELECT
ARRAY[5867::bigint] AS current,
0 AS depth
UNION ALL
SELECT
new_current,
friends_of_friends.depth + 1
FROM
friends_of_friends
CROSS JOIN LATERAL (
SELECT
array_agg(DISTINCT edges.to_node) AS new_current
FROM
edges
WHERE
from_node = ANY(friends_of_friends.current)
) q
WHERE
friends_of_friends.depth < 3
)
SELECT
coalesce(array_length(current, 1), 0) AS count_friends_at_depth_3
FROM
friends_of_friends
WHERE
depth = 3;
;

New roaringbitmap-based query using users instead:

CREATE OR REPLACE VIEW friends_of_friends_roaringbitmap AS
WITH RECURSIVE friends_of_friends AS
(
SELECT
friends,
1 AS depth
FROM users WHERE id = 5867
UNION ALL
SELECT
new_friends,
friends_of_friends.depth + 1
FROM
friends_of_friends
CROSS JOIN LATERAL (
SELECT
rb_or_agg(users.friends) AS new_friends
FROM
users
WHERE
users.id = ANY(rb_to_array(friends_of_friends.friends))
) q
WHERE
friends_of_friends.depth < 3
)
SELECT
rb_cardinality(friends) AS count_friends_at_depth_3
FROM
friends_of_friends
WHERE
depth = 3
;

Note, depth is 1 at first level since we already have user 5867's friends in the users column.

Maybe there is a better way to make use of the btree index on users.id,
than to convert the roaringbitmap to an array in order to use = ANY(...),
that is, this part: `users.id = ANY(rb_to_array(friends_of_friends.friends))`?

Or maybe there is some entirely different but equivalent way of writing the query
in a more efficient way?

SELECT * FROM friends_of_friends;
count_friends_at_depth_3
--------------------------
1035293
(1 row)

SELECT * FROM friends_of_friends_roaringbitmap;
count_friends_at_depth_3
--------------------------
1035293
(1 row)

EXPLAIN ANALYZE SELECT * FROM friends_of_friends;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CTE Scan on friends_of_friends (cost=5722.03..5722.73 rows=1 width=4) (actual time=2232.896..2233.289 rows=1 loops=1)
Filter: (depth = 3)
Rows Removed by Filter: 3
CTE friends_of_friends
-> Recursive Union (cost=0.00..5722.03 rows=31 width=36) (actual time=0.003..2228.707 rows=4 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=36) (actual time=0.001..0.001 rows=1 loops=1)
-> Subquery Scan on "*SELECT* 2" (cost=190.59..572.17 rows=3 width=36) (actual time=556.806..556.837 rows=1 loops=4)
-> Nested Loop (cost=190.59..572.06 rows=3 width=36) (actual time=553.748..553.748 rows=1 loops=4)
-> WorkTable Scan on friends_of_friends friends_of_friends_1 (cost=0.00..0.22 rows=3 width=36) (actual time=0.000..0.001 rows=1 loops=4)
Filter: (depth < 3)
Rows Removed by Filter: 0
-> Aggregate (cost=190.59..190.60 rows=1 width=32) (actual time=737.427..737.427 rows=1 loops=3)
-> Sort (cost=179.45..185.02 rows=2227 width=4) (actual time=577.192..649.812 rows=3486910 loops=3)
Sort Key: edges.to_node
Sort Method: quicksort Memory: 393217kB
-> Index Only Scan using edges_pkey on edges (cost=0.56..55.62 rows=2227 width=4) (actual time=0.027..225.609 rows=3486910 loops=3)
Index Cond: (from_node = ANY (friends_of_friends_1.current))
Heap Fetches: 0
Planning Time: 0.294 ms
Execution Time: 2240.446 ms

EXPLAIN ANALYZE SELECT * FROM friends_of_friends_roaringbitmap;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
CTE Scan on friends_of_friends (cost=799.30..800.00 rows=1 width=8) (actual time=492.925..492.930 rows=1 loops=1)
Filter: (depth = 3)
Rows Removed by Filter: 2
CTE friends_of_friends
-> Recursive Union (cost=0.43..799.30 rows=31 width=118) (actual time=0.061..492.842 rows=3 loops=1)
-> Index Scan using users_pkey on users (cost=0.43..2.65 rows=1 width=118) (actual time=0.060..0.062 rows=1 loops=1)
Index Cond: (id = 5867)
-> Nested Loop (cost=26.45..79.63 rows=3 width=36) (actual time=164.244..164.244 rows=1 loops=3)
-> WorkTable Scan on friends_of_friends friends_of_friends_1 (cost=0.00..0.22 rows=3 width=36) (actual time=0.000..0.001 rows=1 loops=3)
Filter: (depth < 3)
Rows Removed by Filter: 0
-> Aggregate (cost=26.45..26.46 rows=1 width=32) (actual time=246.359..246.359 rows=1 loops=2)
-> Index Scan using users_pkey on users users_1 (cost=0.43..26.42 rows=10 width=114) (actual time=0.074..132.318 rows=116336 loops=2)
Index Cond: (id = ANY (rb_to_array(friends_of_friends_1.friends)))
Planning Time: 0.257 ms
Execution Time: 493.134 ms

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2023-06-05 09:30:01 Re: Support logical replication of DDLs
Previous Message vignesh C 2023-06-05 09:18:04 Re: [PoC] pg_upgrade: allow to upgrade publisher node