Re: Slow join over three tables

From: Tim van der Linden <tim(at)shisaa(dot)jp>
To: Alban Hertroys <haramrae(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Subject: Re: Slow join over three tables
Date: 2016-04-27 13:21:52
Message-ID: 20160427222152.1cf02fc73d7af13e01468704@shisaa.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, 27 Apr 2016 13:48:06 +0200
Alban Hertroys <haramrae(at)gmail(dot)com> wrote:

Hi Alban

Thanks for chiming in!

> Since you're not using age and gender in this (particular) query until the rows are combined into a result set already, it doesn't make a whole lot of sense to add them to the index. Moreover, since your'e ordering by created, I'd at least put that directly after id:
>
> create index on reports (id, created);

The ORDER statement did not attribute much to the total query time, but it makes sense to index this.

> In this case, you're using the values in adverse to filter relevant rid's for the FK join, so you might be better off with the inverse of above index:
> create index on report_adverses (adverse, rid);
> create index on report_drugs (drug, rid);

Hold the phone ... you just cracked the code.

I added these two proposed indexes (reversing their order). After purging the OS disk cache, the query runs in (*drum roll*) 120 ms. You managed to speed this up ~40 times.

The query plan:

Sort (cost=12108.99..12109.83 rows=337 width=41) (actual time=119.517..119.531 rows=448 loops=1)
Sort Key: r.created
Sort Method: quicksort Memory: 60kB
-> Nested Loop (cost=10940.25..12094.84 rows=337 width=41) (actual time=26.473..119.392 rows=448 loops=1)
Join Filter: (d.rid = r.id)
-> Merge Join (cost=10939.69..11414.84 rows=337 width=29) (actual time=25.624..33.650 rows=448 loops=1)
Merge Cond: (d.rid = a.rid)
-> Sort (cost=1417.78..1454.02 rows=14496 width=8) (actual time=4.208..4.976 rows=14074 loops=1)
Sort Key: d.rid
Sort Method: quicksort Memory: 1050kB
-> Index Only Scan using report_drugs_drug_rid_idx on report_drugs d (cost=0.44..415.86 rows=14496 width=8) (actual time=0.648..2.236 rows=14200 loops=1)
Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
Heap Fetches: 0
-> Sort (cost=9521.91..9721.56 rows=79860 width=21) (actual time=21.410..24.174 rows=76974 loops=1)
Sort Key: a.rid
Sort Method: quicksort Memory: 6682kB
-> Index Only Scan using report_adverses_adverse_rid_idx on report_adverses a (cost=0.56..3019.24 rows=79860 width=21) (actual time=0.916..10.689 rows=76972 loops=1)
Index Cond: (adverse = ANY ('{"back pain - nonspecific","nonspecific back pain","back pain"}'::text[]))
Heap Fetches: 0
-> Index Only Scan using reports_id_age_gender_created_idx on reports r (cost=0.56..2.01 rows=1 width=20) (actual time=0.190..0.191 rows=1 loops=448)
Index Cond: (id = a.rid)
Heap Fetches: 0
Planning time: 15.466 ms
Execution time: 119.871 ms

This is amazing, and the slow down was purely a fact of missing indexes on the sweet spot.
You can now clearly see that the above two indexes are used (report_drugs_drug_rid_idx) and (report_adverses_adverse_rid_idx) and that the nested loop is much faster.

> Do these tables have a primary key and can that be used here?

Only the "reports" table has a PK, the other two don't (only a foreign key to "reports").

> I hope I'm not being superfluous here, but don't forget to vacuum analyze after creating those indexes. Eventually autovacuum will do that for you, but you could otherwise be running tests to verify the impact of adding those indexes before autovacuum gets around to it.

I'm becoming the king of manual VACUUM-ing the past few days ... no need to point that out ;)

> Finally, perhaps it's more efficient to weed out all unrelated drugs and adverses before relating them to reports, something like:
>
> SELECT r.id, r.age, r.gender, r.created, x.adverse, x.drug
> FROM reports r
> JOIN (
> SELECT d.rid, a.adverse, d.drug
> FROM report_drugs d
> JOIN report_adverses a ON a.rid = d.rid
> WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back pain', 'back pain'])
> AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363])
> ) x ON x.rid = r.id
> ORDER BY r.created;

With the above indexes created, disk cache flushed, this query ran at the exact same speed, the plan:

Sort (cost=12108.14..12108.99 rows=337 width=41) (actual time=119.102..119.111 rows=448 loops=1)
Sort Key: r.created
Sort Method: quicksort Memory: 60kB
-> Nested Loop (cost=10940.25..12094.00 rows=337 width=41) (actual time=26.797..118.969 rows=448 loops=1)
-> Merge Join (cost=10939.69..11414.84 rows=337 width=29) (actual time=25.899..33.203 rows=448 loops=1)
Merge Cond: (d.rid = a.rid)
-> Sort (cost=1417.78..1454.02 rows=14496 width=8) (actual time=4.319..4.981 rows=14074 loops=1)
Sort Key: d.rid
Sort Method: quicksort Memory: 1050kB
-> Index Only Scan using report_drugs_drug_rid_idx on report_drugs d (cost=0.44..415.86 rows=14496 width=8) (actual time=0.748..2.369 rows=14200 loops=1)
Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
Heap Fetches: 0
-> Sort (cost=9521.91..9721.56 rows=79860 width=21) (actual time=21.573..23.940 rows=76974 loops=1)
Sort Key: a.rid
Sort Method: quicksort Memory: 6682kB
-> Index Only Scan using report_adverses_adverse_rid_idx on report_adverses a (cost=0.56..3019.24 rows=79860 width=21) (actual time=0.922..10.637 rows=76972 loops=1)
Index Cond: (adverse = ANY ('{"back pain - nonspecific","nonspecific back pain","back pain"}'::text[]))
Heap Fetches: 0
-> Index Only Scan using reports_id_age_gender_created_idx on reports r (cost=0.56..2.01 rows=1 width=20) (actual time=0.190..0.191 rows=1 loops=448)
Index Cond: (id = a.rid)
Heap Fetches: 0
Planning time: 15.238 ms
Execution time: 119.431 ms

So your hunch was right, not much difference there.

But man ... this query is now flying ... Hat's off to you, sir!

> Alban Hertroys

Cheers,
Tim

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Stefan Weiss 2016-04-27 13:23:21 Filtering the results of UNION ALL vs filtering the separate queries
Previous Message Alban Hertroys 2016-04-27 11:48:06 Re: Slow join over three tables