Bad plan chosen for union all

From: Alex Reece <awreece(at)gmail(dot)com>
To: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Bad plan chosen for union all
Date: 2017-11-28 10:13:38
Message-ID: CANywC6DvyXn1bMuo7+UAKvMKaCk9vvD7ez5WPS-t+bAc3U7btQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I'm on PostgreSQL 9.6.5 and getting an awkwardly bad plan chosen for my
query.

I want to do:

select investments.id, cim.yield
FROM contributions
JOIN investments ON contributions.investment_id = investments.id
JOIN contribution_investment_metrics_view cim ON cim.investment_id =
investments.id
WHERE contributions.id IN ('\x58c9c0d3ee944c48b32f814d', '\x11')
Where contribution_investment_metrics_view is morally

select investment_id, first(val) from (select * from contribution_metrics
UNION ALL select * from investment_metrics) group by id

Typically, querying this view is very fast since I have indexes in both
component queries, leading to a very tight plan:

Sort Key: "*SELECT* 1".metric
-> Subquery Scan on "*SELECT* 1" (cost=14.68..14.68 rows=1 width=26)
(actual time=0.043..0.044 rows=2 loops=1)
-> Sort (cost=14.68..14.68 rows=1 width=42) (actual
time=0.042..0.043 rows=2 loops=1)
Sort Key: cm.metric, cm.last_update_on DESC
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=0.14..14.68 rows=1 width=42) (actual
time=0.032..0.034 rows=2 loops=1)
-> Index Scan using contributions_investment_id_idx on
contributions (cost=0.08..4.77 rows=2 width=26) (actual time=0.026..0.027
rows=1 loops=1)
Index Cond: (investment_id = $1)
-> Index Only Scan using
contribution_metrics_contribution_id_metric_last_update_on_idx on
contribution_metrics cm (cost=0.06..4.95 rows=2 width=34) (actual
time=0.005..0.006 r
Index Cond: (contribution_id = contributions.id)
Heap Fetches: 2
-> Subquery Scan on "*SELECT* 2" (cost=0.08..5.86 rows=3 width=26)
(actual time=0.008..0.008 rows=3 loops=1)
-> Index Only Scan using
investment_metrics_investment_id_metric_last_updated_on_idx on
investment_metrics im (cost=0.08..5.85 rows=3 width=42) (actual
time=0.008..0.008 rows=3 loops=1)
Index Cond: (investment_id = $1)
Heap Fetches: 3

Unfortunately, when I try to query this view in the larger query above, I
get a *much* worse plan for this view, leading to >1000x degradation in
performance:

-> Append (cost=10329.18..26290.92 rows=482027 width=26) (actual
time=90.157..324.544 rows=482027 loops=1)
-> Subquery Scan on "*SELECT* 1" (cost=10329.18..10349.44 rows=5788
width=26) (actual time=90.157..91.207 rows=5788 loops=1)
-> Sort (cost=10329.18..10332.08 rows=5788 width=42) (actual
time=90.156..90.567 rows=5788 loops=1)
Sort Key: contributions_1.investment_id, cm.metric,
cm.last_update_on DESC
Sort Method: quicksort Memory: 645kB
-> Hash Join (cost=105.62..10256.84 rows=5788 width=42)
(actual time=1.924..85.913 rows=5788 loops=1)
Hash Cond: (contributions_1.id = cm.contribution_id)
-> Seq Scan on contributions contributions_1
(cost=0.00..9694.49 rows=351495 width=26) (actual time=0.003..38.794
rows=351495 loops=1)
-> Hash (cost=85.36..85.36 rows=5788 width=34)
(actual time=1.907..1.907 rows=5788 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 453kB
-> Seq Scan on contribution_metrics cm
(cost=0.00..85.36 rows=5788 width=34) (actual time=0.003..0.936 rows=5788
loops=1)
-> Subquery Scan on "*SELECT* 2" (cost=0.08..15941.48 rows=476239
width=26) (actual time=0.017..203.006 rows=476239 loops=1)
-> Index Only Scan using
investment_metrics_investment_id_metric_last_updated_on_idx1 on
investment_metrics im (cost=0.08..14512.76 rows=476239 width=42) (actual
time=0.016..160.410 rows=476239 l
Heap Fetches: 476239

I've played around with a number of solutions (including lateral joins) and
the closest I can come is:

select investment_id
from contribution_investment_metrics
where investment_id = (
select investments.id
from investments
join contributions on investments.id = contributions.investment_id
where contributions.id = '\x58c9c0d3ee944c48b32f814d'
)

This doesn't really work for my purposes, since I want to project columns
from contributions and investments and I want to run this query on "up to a
handful" contributions at once (maybe more than one, never more than 100).

I'm on PostgreSQL 9.6.5.
Schema and full explain analyzes:
https://gist.github.com/awreece/28c359c6d834717ab299665022b19fd6
I don't think it's relevant, but since
https://wiki.postgresql.org/wiki/SlowQueryQuestions asks -- I'm running in
Heroku.

What are my options here? Currently, I'm planning to avoid these bad plans
by using a less straightforward query for the view:

SELECT
coalesce(contrib.id, cm.contribution_id) AS contribution_id,
coalesce(cm.yield, im.yield) AS yield,
coalesce(cm.term, im.term) AS term
FROM contributions contrib
JOIN investment_metrics_view im ON im.investment_id = contrib.investment_id
FULL OUTER JOIN contribution_metrics_view cm ON cm.contribution_id =
contrib.id

Best,
~Alex Reece

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message bima p 2017-11-28 18:56:24 Invalid mem alloc request on function
Previous Message Andres Freund 2017-11-27 23:57:24 Re: Setting effective_io_concurrency in VM?