Fwd: Array of integer indexed nested-loop semi join

From: Mickael van der Beek <mickael(dot)van(dot)der(dot)beek(at)gmail(dot)com>
To: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Fwd: Array of integer indexed nested-loop semi join
Date: 2022-04-27 12:18:38
Message-ID: CACM-Oyeu7sHLXNfVmX-_BQboJWTeCPEafpwjiKMAmzh-c9o5mQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello everyone,

*1) Context*

I'm working with large tables containing arrays of integers, indexed with "
*gin__int_ops*" GIN indexes offered by the "*intarray*" extension.
The goal I'm trying to achieve is to do a "nested loop semi join" using the
array inclusion operation (@>) as join condition but in an indexed way.
(Basically an INNER JOIN without the duplicate rows and without needing to
use columns from the joined table.)

*2) Configuration*

The queries are run on a PostgreSQL v14 server with 32GB RAM and 8 vCPUs on
a 64 bit ARM Neoverse architecture (m6g.2xlarge AWS RDS instance).
PostgreSQL's configuration uses the following key values:

- work_mem = 8GB (only set for this query)
- shared_buffers = 8GB
- effective_cache_size = 22GB
- max_worker_processes = 8
- max_parallel_workers_per_gather = 4
- jit = on

*3) Tables*

The "light_pages_attributes" contains about 2 million rows, each with an
"attributes" column containing on average 20 integers.

CREATE TABLE
> light_pages_attributes
> (
> id INTEGER NOT NULL,
> "attributes" INTEGER[] NOT NULL
> )
> ;
> CREATE INDEX
> light_pages_attributes_attributes
> ON
> light_pages_attributes
> USING
> gin
> (
> attributes gin__int_ops
> )
> ;

The "light_pages_views" contains about 25 million rows, each with a
"page_ids" column containing on average 20 integers as well.

CREATE TABLE
> light_pages_views
> (
> vector_id BIGINT NOT NULL,
> page_ids INTEGER[] NOT NULL
> )
> ;
> CREATE INDEX
> light_pages_views_page_ids
> ON
> light_pages_views
> USING
> gin
> (
> page_ids gin__int_ops
> )
> ;

*4) Query*

The query I'm trying to optimise is the following:

BEGIN;

SET LOCAL work_mem = '8GB';

CREATE TEMPORARY VIEW
> urls
> AS
> (
> SELECT ARRAY[lpa.id]
> AS page_id
> FROM
> light_pages_attributes
> AS lpa
> WHERE
> lpa."attributes" @> ARRAY[189376]
> );
> EXPLAIN (
> ANALYZE,
> VERBOSE,
> COSTS,
> BUFFERS,
> TIMING
> )
> SELECT
> COUNT(*)
> FROM
> light_pages_views
> AS lpv
> WHERE
> EXISTS (
> SELECT
> 1
> FROM
> urls
> AS u
> WHERE
> lpv.page_ids @> u.page_id
> )
> ;

COMMIT;

The last query does not finish after waiting for more than 15 minutes.
(The temporary view creation is very fast and required due to the same
query in a CTE greatly reducing performance (by more than 5 min.) due to
the optimisation barrier I'm guessing.)
This alternative query, which should be far slower due to the fact that it
generates duplicate lines through the INNER JOIN, is in fact much faster, 1
min. and 39 s.:

EXPLAIN (
> ANALYZE,
> VERBOSE,
> COSTS,
> BUFFERS,
> TIMING
> )
> SELECT
> COUNT(*)
> FROM
> (
> SELECT
> 1
> FROM
> light_pages_views
> AS lpv
> INNER JOIN
> urls
> AS u
> ON lpv.page_ids @> u.page_id
> GROUP BY
> lpv.vector_id
> )
> AS t
> ;

Visual query plan: https://explain.dalibo.com/plan/bc3#plan
Raw query plan: https://explain.dalibo.com/plan/bc3#raw

Other strategies I've tried as well:

- lpv.page_ids @> ANY(SELECT u.page_id FROM urls AS u)
- FULL OUTER JOIN, not possible due to the condition not being
merge-joinable

The end-goal would be to update all matching "light_pages_views" rows by
appending an integer to their array of integer.
So possibly millions of tows to be updated.

Thank you a lot in advance for your help!

Mickael

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Jeff Janes 2022-04-27 14:28:16 Re: Array of integer indexed nested-loop semi join
Previous Message David Rowley 2022-04-27 08:22:12 Re: Performance differential when 0 values present vs when 1 values present. Planner return 52k rows when 0 expected.