When you really want to force a certain join type?

From: Gunther Schadow <raj(at)gusw(dot)net>
To: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: When you really want to force a certain join type?
Date: 2022-12-28 15:39:14
Message-ID: 70b2ee7b-8e8b-d884-8ca2-590a242308b8@gusw.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I have a complex query which essentially runs a finite state automaton
through a with recursive union, adding the next state based on the
previous.  This is run at 100,000 or a million start states at the same
time, picking a new record (token), matching it to the FSA (a three-way
join:

token inner join next token * state-transition-table -> next state

I know this doesn't really tell you much. The following might give you a
glimpse:

with recursive Token as (
select * from steps left outer join token using(event)
limit 100000
), StartStates as (
select pathId, start, end, m.new_state as state, m.goalId
from Token w inner join FSA m
on(m.token = w.token and m.old_state = w.state)
), Phrase as (
select pathId, start, end, state, goalId
from StartStates
union all
select p.pathId, p.start, n.end, n.new_state as state, n.goalId
from Phrase p
inner join (
select pathId, start, end, old_state as state, new_state, f.goalId
from Token inner join FSA f using(token)
) n using(pathId, end, state)

There are 100s of thousands of states. This join has a HUGE fan out if
it is not immediately limited by the chaining criterion on the
old_state. So any attempt to use merge join or hash join which will
compute the whole big thing and only then apply the chaining criterion,
will just create massive amounts of sort load and/or humongous hash
tables only to throw the vast majority away every time. But when it runs
through nested loops, the indexes help to make it really quick.

I cannot show you the exact data, but I can show you the plan that works
amazingly fast:

Insert on good_paths (cost=912224.51..912228.71 rows=240 width=302)
CTE token
-> Limit (cost=46728.24..81127.75 rows=100000 width=519)
-> Hash Left Join (cost=46728.24..115752.23 rows=200654 width=519)
... this is creating the start states
CTE path
-> Recursive Union (cost=23293.75..831082.45 rows=241 width=278)
-> Merge Join (cost=23293.75..289809.83 rows=171 width=278)
Merge Cond: ((m.old_state = w_1.state) AND (m.token = w_1.token))
-> Index Scan using fsa_old_state_token_idx on fsa m (cost=0.43..245365.63 rows=4029834 width=28)
-> Materialize (cost=23293.32..23793.32 rows=100000 width=278)
-> Sort (cost=23293.32..23543.32 rows=100000 width=278)
Sort Key: w_1.state, w_1.token
-> CTE Scan on token w_1 (cost=0.00..2000.00 rows=100000 width=278)
-> Nested Loop (cost=18295.78..54126.78 rows=7 width=278)
-> Merge Join (cost=18295.35..19120.16 rows=4275 width=340)
Merge Cond: ((token.pathid = p.pathid) AND (token.start = p.end))
-> Sort (cost=18169.32..18419.32 rows=100000 width=160)
Sort Key: token.pathid, token.start
-> CTE Scan on token (cost=0.00..2000.00 rows=100000 width=160)
-> Sort (cost=126.03..130.30 rows=1710 width=212)
Sort Key: p.pathid, p.end
-> WorkTable Scan on path p (cost=0.00..34.20 rows=1710 width=212)
-> Index Scan using fsa_old_state_token_idx on fsa f (cost=0.43..8.18 rows=1 width=28)

Now, when that initial token list (of start states) grows beyond this
limit of 100,000, the execution plan flips:

Insert on good_paths (cost=2041595.63..2041606.66 rows=630 width=302)
CTE token
-> Limit (cost=46728.24..115752.23 rows=200654 width=519)
-> Hash Left Join (cost=46728.24..115752.23 rows=200654 width=519)
... this is creating the start states
CTE path
-> Recursive Union (cost=47749.96..1925801.45 rows=633 width=278)
-> Merge Join (cost=47749.96..315274.30 rows=343 width=278)
Merge Cond: ((m.old_state = w_1.state) AND (m.token = w_1.token))
-> Index Scan using fsa_old_state_token_idx on fsa m (cost=0.43..245365.63 rows=4029834 width=28)
-> Materialize (cost=47749.53..48752.80 rows=200654 width=278)
-> Sort (cost=47749.53..48251.16 rows=200654 width=278)
Sort Key: w_1.state, w_1.token
-> CTE Scan on token w_1 (cost=0.00..4013.08 rows=200654 width=278)
-> Merge Join (cost=158013.87..161051.45 rows=29 width=278)
Merge Cond: ((token.token = f.token) AND (token.pathid = p.pathid) AND (token.start = p.end))
-> Sort (cost=37459.53..37961.16 rows=200654 width=160)
Sort Key: token.token, token.pathid, token.start
-> CTE Scan on token (cost=0.00..4013.08 rows=200654 width=160)
-> Materialize (cost=120554.35..120966.44 rows=82419 width=228)
-> Sort (cost=120554.35..120760.39 rows=82419 width=228)
Sort Key: f.token, p.pathid, p.end
-> Nested Loop (cost=0.43..104808.55 rows=82419 width=228)
-> WorkTable Scan on path p (cost=0.00..68.60 rows=3430 width=212)
-> Index Scan using fsa_old_state_token_idx on fsa f (cost=0.43..30.30 rows=24 width=28)

Once this merge join kicks in, the query essentially stalls (I mean,
each of the limited components runs in seconds, and I can iteratively
run them so that my initial set of tokens never grows past 100,000, and
then I can complete everything in about linear time, each iteration
takes about linear time proportional with the amount of tokens. But with
the merge join it doesn't complete before several times that amount of time.

I doubt that I can find any trick to give to the planner better data
which it can then use to figure out that the merge join is a bad
proposition.

I wish I could just force it. I probably had this discussion here some
years ago. I think that while the PostgreSQL optimizer is pretty good,
there are situations such as this where its predictions do not work.

Note, for my immediate relief I have forced it by simply set
enable_mergejoin=off. This works fine, except, it converts both into a
nested loop, but the upper merge join was not a problem, and sometimes
(most often) nested loop is a bad choice for bulk data. It's only for
this recursive query it sometimes makes sense.

regards,
-Gunther

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Justin Pryzby 2022-12-28 15:48:37 Re: When you really want to force a certain join type?
Previous Message Frits Jalvingh 2022-12-23 08:12:08 Re: Fwd: temp_file_limit?