Re: Array of integer indexed nested-loop semi join

From: Mickael van der Beek <mickael(dot)van(dot)der(dot)beek(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Pgsql Performance <pgsql-performance(at)lists(dot)postgresql(dot)org>
Subject: Re: Array of integer indexed nested-loop semi join
Date: 2022-05-23 07:57:25
Message-ID: CACM-Oyd=z=qeDnFmp5PBW4+tHecggJie4KcW=L9uS68PCFFu4g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello Jeff,

Sadly, the query you suggested won't work because you are only returning
the first row of the matching inner query rows.
Example:

SELECT
> u.idx,
> u.page_idxs
> FROM
> (
> VALUES
> (1, ARRAY[11, 21, 31]),
> (2, ARRAY[12, 21, 32]),
> (3, ARRAY[13, 23, 31])
> )
> AS u(idx, page_idxs)
> WHERE
> u.page_idxs && ARRAY[(
> SELECT
> p.idx
> FROM
> (
> VALUES
> (11, ARRAY[101, 201, 301]),
> (21, ARRAY[102, 201, 302]),
> (13, ARRAY[103, 203, 301])
> )
> AS p(idx, attribute_idxs)
> WHERE
> p.attribute_idxs && ARRAY[201]
> FETCH FIRST 1 ROWS ONLY
> )]
> ;

This query only returns one row while it should actually return two:

1 {11,21,31}

The INNER JOIN version of the query will return all matching rows but also
include duplicates:

SELECT
> u.idx,
> u.page_idxs
> FROM
> (
> VALUES
> (1, ARRAY[11, 21, 31]),
> (2, ARRAY[12, 21, 32]),
> (3, ARRAY[13, 23, 31])
> )
> AS u(idx, page_idxs)
> INNER JOIN
> (
> SELECT
> p.idx
> FROM
> (
> VALUES
> (11, ARRAY[101, 201, 301]),
> (21, ARRAY[102, 201, 302]),
> (13, ARRAY[103, 203, 301])
> )
> AS p(idx, attribute_idxs)
> WHERE
> p.attribute_idxs && ARRAY[201]
> )
> AS p2
> ON u.page_idxs && ARRAY[p2.idx]
> ;

Results:

1 {11,21,31}
> 1 {11,21,31}
> 2 {12,21,32}

As far as I know, the the IN + sub-expression query can't work since the
left side of the operation is an array of integers and the right side a set
of rows with a single integer column.
The reason I'm using integer arrays is because it is the only way I have
found in PostgreSQL to get fast inclusion / exclusion checks on large
datasets (hundreds of millions of values).
Did I misunderstand your response?
Thank you for the ongoing help,

Mickael

On Mon, May 23, 2022 at 4:45 AM Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

>
>
> On Fri, May 20, 2022 at 6:42 AM Mickael van der Beek <
> mickael(dot)van(dot)der(dot)beek(at)gmail(dot)com> wrote:
>
>>
>> Query:
>>
>> EXPLAIN (
>>> ANALYZE,
>>> VERBOSE,
>>> COSTS,
>>> BUFFERS,
>>> TIMING
>>> )
>>> SELECT
>>> fu.w2_page_idxs
>>> FROM
>>> fact_users
>>> AS fu
>>> WHERE
>>> EXISTS (
>>> SELECT
>>> FROM
>>> (
>>> SELECT
>>> ARRAY[idx] AS page_idx
>>> FROM
>>> fact_pages
>>> WHERE
>>> attribute_idxs && ARRAY[300000160]
>>> FETCH FIRST 1 ROWS ONLY
>>> )
>>> AS fp
>>> WHERE
>>> fu.w2_page_idxs && fp.page_idx
>>> )
>>> ;
>>
>>
>> Without any surprises, the planner is using a sequential scan on the
>> "fact_users" table which is very large instead of using the GIN index set
>> on the "w2_page_idxs" column.
>>
>
> For me, using the subquery in and expression, instead of the EXISTS, does
> get it to use the gin index. And I think it must give the same results.
>
> SELECT
> fu.w2_page_idxs
> FROM fact_users AS fu
> WHERE
> fu.w2_page_idxs && ARRAY[(select idx from fact_pages where
> attribute_idxs && ARRAY[3003] FETCH FIRST 1 ROWS ONLY)];
>
> But why are you using intarray? That is unnecessary here, and by creating
> ambiguity about the array operators it might be harmful.
>
> Cheers,
>
> Jeff
>
>>

--
Mickael van der BeekWeb developer & Security analyst

mickael(dot)van(dot)der(dot)beek(at)gmail(dot)com

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Jeff Janes 2022-05-23 14:10:48 Re: Array of integer indexed nested-loop semi join
Previous Message Jeff Janes 2022-05-23 02:45:14 Re: Array of integer indexed nested-loop semi join