Re: Getting an index scan to be a parallel index scan

From: Alex Kaiser <alextkaiser(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pryzby(at)telsasoft(dot)com
Cc: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Re: Getting an index scan to be a parallel index scan
Date: 2023-02-02 00:54:17
Message-ID: CAN4ko3Cj9caaEVqdc3WM_LZu=Ud=G2Ko-tZCgi2nUwoLVqSveg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Justin,

I did try changing min_parallel_index_scan_size /
min_parallel_table_scan_size and didn't see any change (the below is with
force_parallel_mode = off):

postgres=# set min_parallel_index_scan_size = 0;
SET
postgres=# set min_parallel_table_scan_size = 0;
SET
postgres=# explain select * from testing where id =
ANY(array[1608377,5449811, ... <removed for brevity> ...
,4654284,3558460]::integer[]);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Index Scan using testing_pkey on testing (cost=0.43..6138.81 rows=1000
width=74)
Index Cond: (id = ANY ('{1608377,5449811, ... < removed for brevity >
... 4654284,3558460}'::integer[]))
(2 rows)

As for 'force_parallel_mode', while this isn't "debugging PG", it isn't
something that I would actually turn on production, just something I was
playing with to see the cost of parallel queries when the planner might not
think they are the most efficient.

Thomas,

Thanks for the explanation. Yes, that is the query plan I was imagining. I
do see how chopping it up could result in an unfair distribution. But my
counter to that would be that wouldn't chopping it up still be better than
not. If things do happen to work out to be fair, now it's X times as fast,
if things are very unfair, then you haven't really lost much (besides the
parallel overhead) compared to the non-parallel query. Or maybe it should
be possible to do the parallel query if there were some statistics (either
normal ones or extended ones) that told the planner that the result would
probably be fair?

Though I do agree that the "work stealing" option would be the most
efficient, but would be a lot more complicated to code up.

I tried out inserting into a separate table, and as you guessed that
worked. For my production scenario that isn't really feasible, but still
cool to see it work.

postgres=# create table ids(
probe_id int PRIMARY KEY
);

insert into ids(probe_id) values (774494);
insert into ids(probe_id) values (9141914);
...

postgres=# select count(*) from ids;
count
-------
1000
(1 row)

postgres=# explain select * from testing where id in (select * from ids);
QUERY PLAN
-----------------------------------------------------------------------------------------
Gather (cost=0.43..3504.67 rows=1000 width=74)
Workers Planned: 2
-> Nested Loop (cost=0.43..3504.67 rows=417 width=74)
-> Parallel Seq Scan on ids (cost=0.00..9.17 rows=417 width=4)
-> Index Scan using testing_pkey on testing (cost=0.43..8.37
rows=1 width=74)
Index Cond: (id = ids.probe_id)
(6 rows)

Thanks,
Alex Kaiser

On Wed, Feb 1, 2023 at 1:52 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:

> On Wed, Feb 1, 2023 at 6:39 PM Alex Kaiser <alextkaiser(at)gmail(dot)com> wrote:
> > select * from testing where id in (1608377,5449811, ... <1000 random
> ids> ,4654284,3558460);
> >
> > Essentially I have a list of 1000 ids and I would like the rows for all
> of those ids.
> >
> > This seems like it would be pretty easy to parallelize, if you have X
> threads then you would split the list of IDs into 1000/X sub lists and give
> one to each thread to go find the rows for ids in the given list. Even
> when I use the following configs I don't get a query plan that actually
> uses any parallelism:
>
> It sounds like the plan you are imagining is something like:
>
> Gather
> Nested Loop Join
> Outer side: <partial scan of your set of constant values>
> Inner side: Index scan of your big table
>
> Such a plan would only give the right answer if each process has a
> non-overlapping subset of the constant values to probe the index with,
> and together they have the whole set. Hypothetically, a planner could
> chop that set up beforehand and and give a different subset to each
> process (just as you could do that yourself using N connections and
> separate queries), but that might be unfair: one process might find
> lots of matches, and the others might find none, because of the
> distribution of data. So you'd ideally want some kind of "work
> stealing" scheme, where each worker can take more values to probe from
> whenever it needs more, so that they all keep working until the values
> run out. We don't have a thing that can do that. You might imagine
> that a CTE could do it, so WITH keys_to_look_up AS (VALUES (1), (2),
> ...) SELECT ... JOIN ON ..., but that also doesn't work because we
> don't have a way to do "partial" scans of CTEs either (though someone
> could invent that). Likewise for temporary tables: they are invisible
> to parallel workers, so they can't help us. I have contemplated
> "partial function scans" for set-returning functions, where a function
> could be given a bit of shared memory and various other infrastructure
> to be able to be "parallel aware" (= able to coordinate across
> processes so that each process gets a subset of the data), and one
> could imagine that that would allow various solutions to the problem,
> but that's vapourware.
>
> But you can get a plan like that if you insert all those values into a
> regular table, depending on various settings, stats and
> min_parallel_table_scan_size (try 0, I guess that'll definitely do
> it). Which probably isn't the answer you wanted to hear.
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Thomas Munro 2023-02-02 01:48:14 Re: Getting an index scan to be a parallel index scan
Previous Message Thomas Munro 2023-02-01 21:51:38 Re: Getting an index scan to be a parallel index scan