Re: Array of integer indexed nested-loop semi join

From: Mickael van der Beek <mickael(dot)van(dot)der(dot)beek(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Pgsql Performance <pgsql-performance(at)lists(dot)postgresql(dot)org>
Subject: Re: Array of integer indexed nested-loop semi join
Date: 2022-05-23 14:46:41
Message-ID: CACM-OyccRwrO0WFfs7JRMqSqzLZK1bGRK9GwptMR1c02OzmdKQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello Jeff,

Thank you again for your advice.

I did indeed think of the ARRAY_AGG() version of the query.
Although this method is very fast (and does use indexes) for smallish array
sizes, this is sadly not practical in my case because the arrays of
matching rows can reach multiple hundreds of thousands of rows.
I thought of maybe "batching" the ARRAY_AGG() in batches of max n rows in a
subquery and then calculating intersection on that but it doesn't seem
practical or faster in the end.

> You could just add a DISTINCT to get rid of the duplicates. Of course
that will also take some time on a large returned data set, but probably
less time than scanning a giant table. I think this is probably cleaner
than the alternatives.

Yes, and a GROUP BY will do the trick as well.
The fact that the current solution is a "nested loop" instead of a "nested
loop semi join" means that the query is much slower due to needing to GROUP
BY the rows.
This is why I tried various version using EXISTS, ANY, ARRAY_AGG(), etc
with no avail.
Would you have an idea on why PostgreSQL doesn't use the existing indexes
for this type of subqueries ?

> I don't know if you misunderstood. I meant specifically the intarray
extension. You can use integer arrays with built-in GIN indexes without
help from the intarray extension. Maybe you know that already and are just
saying that the extension is even faster than the built-in indexed
operators are and you need that extra speed.

Are there specific advantages to not using the intarray extension and it's
indexes in this case?
I was under the impression that it supported more operation types and was
generally faster for this niche use case.

Thank you again for your help,

Mickael

On Mon, May 23, 2022 at 4:11 PM Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> On Mon, May 23, 2022 at 3:57 AM Mickael van der Beek <
> mickael(dot)van(dot)der(dot)beek(at)gmail(dot)com> wrote:
>
>> Hello Jeff,
>>
>> Sadly, the query you suggested won't work because you are only returning
>> the first row of the matching inner query rows.
>>
>
> Sure, but the query I replaced did the same thing. (I thought that was
> what you wanted, but I guess that was just to get it to run fast enough to
> ever finish--in that case it is probably better to use EXPLAIN without the
> ANALYZE so that we can see the plan of the correct query). To get around
> that one-row limit you have to write it somewhat differently, getting rid
> of the ARRAY and adding an array_agg():
>
> SELECT fu.*
> FROM
> fact_users AS fu
> WHERE
> fu.w2_page_idxs && (select array_agg(idx) from fact_pages where
> attribute_idxs && ARRAY[201]);
>
> This way of writing it is better, as it still works with the LIMIT 1 but
> also works without it. This still uses the indexes for me, at least when
> enable_seqscan is off.
>
>
>> The INNER JOIN version of the query will return all matching rows but
>> also include duplicates:
>>
>
> You could just add a DISTINCT to get rid of the duplicates. Of course
> that will also take some time on a large returned data set, but probably
> less time than scanning a giant table. I think this is probably cleaner
> than the alternatives.
>
>
>>
>> The reason I'm using integer arrays is because it is the only way I have
>> found in PostgreSQL to get fast inclusion / exclusion checks on large
>> datasets (hundreds of millions of values).
>> Did I misunderstand your response?
>>
>
> I don't know if you misunderstood. I meant specifically the intarray
> extension. You can use integer arrays with built-in GIN indexes without
> help from the intarray extension. Maybe you know that already and are just
> saying that the extension is even faster than the built-in indexed
> operators are and you need that extra speed.
>
> Cheers,
>
> Jeff
>
>>

--
Mickael van der BeekWeb developer & Security analyst

mickael(dot)van(dot)der(dot)beek(at)gmail(dot)com

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message 徐志宇徐 2022-05-24 16:25:28 How to monitor Postgres real memory usage
Previous Message Jeff Janes 2022-05-23 14:10:48 Re: Array of integer indexed nested-loop semi join