Re: Slow join over three tables

From: Sameer Kumar <sameer(dot)kumar(at)ashnik(dot)com>
To: Tim van der Linden <tim(at)shisaa(dot)jp>, pgsql-general(at)postgresql(dot)org
Subject: Re: Slow join over three tables
Date: 2016-04-27 01:45:55
Message-ID: CADp-Sm52ZSObOR0pHOMeWu+-qAEGFgSURkPRC5NryWKdnkmMww@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, 27 Apr 2016 07:28 Tim van der Linden, <tim(at)shisaa(dot)jp> wrote:

> Hi all
>
> I have asked this question in a somewhat different form on the DBA
> Stackexchange site, but without much luck (
> https://dba.stackexchange.com/questions/136423/postgresql-slow-join-on-three-large-tables).
> So I apologize for possible double posting, but I hope this might get a
> better response on the mailing list.
>
> I'm joining three fairly large tables together, and it is slow. The tables
> are:
>
> - "reports": 6 million rows
> - "report_drugs": 20 million rows
> - "report_adverses": 20 million rows
>
> The table "reports" holds main report data and has a primary key column
> "id". The other two tables have a foreign key to that table with "rid". It
> are those columns that I use to join them together.
>
> All tables have indexes on the "id"/"rid" columns and on the
> "drug"/"adverse" columns.
>
> The query:
>
> SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
> FROM reports r
> JOIN report_drugs d ON d.rid = r.id
> JOIN report_adverses a ON a.rid = r.id
> WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back
> pain', 'back pain'])
> AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;
>
> The plan:
>
> Sort (cost=105773.63..105774.46 rows=333 width=76) (actual
> time=5143.162..5143.185 rows=448 loops=1)
> Sort Key: r.created
> Sort Method: quicksort Memory: 60kB
> -> Nested Loop (cost=1.31..105759.68 rows=333 width=76) (actual
> time=54.784..5142.872 rows=448 loops=1)
> Join Filter: (d.rid = a.rid)
> -> Nested Loop (cost=0.87..94657.59 rows=14005 width=72) (actual
> time=0.822..2038.952 rows=14199 loops=1)
>

Quite clearly the nested loop joins are the most costly operations here.

-> Index Scan using report_drugs_drug_idx on report_drugs d
> (cost=0.44..500.28 rows=14005 width=31) (actual time=0.669..3.900
> rows=14199 loops=1)
> Index Cond: (drug = ANY
> ('{359,360,361,362,363}'::integer[]))
> -> Index Scan using reports_id_key on reports r
> (cost=0.43..6.71 rows=1 width=41) (actual time=0.143..0.143 rows=1
> loops=14199)
> Index Cond: (id = d.rid)
> -> Index Scan using report_adverses_rid_idx on report_adverses a
> (cost=0.44..0.78 rows=1 width=12) (actual time=0.218..0.218 rows=0
> loops=14199)
> Index Cond: (rid = r.id)
> Filter: (adverse = ANY ('{"back pain -
> nonspecific","nonspecific back pain","back pain"}'::text[]))
> Rows Removed by Filter: 5
>

I suppose. It might help if the filters are performed before the join. I am
not an expert on optimizer but I guess it might help if you change the join
order and add duplicate conditions for reports-

SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
FROM report_drugs d
JOIN report_adverses a ON a.rid = d.rid
JOIN reports r ON d.rid = r.id
WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back
pain', 'back pain'])
AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;

OR since you are using INNER JOIN, (As far as I understand the concept of
joins) it won't hurt the result set if the where clause is pushed into the
INNER JOIN criteria-

SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
FROM report_drugs d
JOIN report_adverses a ON a.rid = d.rid AND
a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back pain',
'back pain'])
AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;
JOIN reports r ON d.rid = r.id;

Planning time: 13.994 ms
> Execution time: 5143.235 ms
>
> This takes well over 5 seconds, which to me, feels much too slow.
>

Well given your hardware (which seems to be pretty decent), I am not sure
if this is an unfair expectation.

> If I query each table directly with the same conditions, thus:
>
> SELECT reason
> FROM report_drugs
> WHERE drug = ANY (ARRAY[359, 360, 361, 362, 363]);
>
> I get:
>
> Index Scan using report_drugs_drug_idx on report_drugs (cost=0.44..500.28
> rows=14005 width=27) (actual time=0.621..4.510 rows=14199 loops=1)
> Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
> Planning time: 6.939 ms
> Execution time: 4.759 ms
>
> Under 5 ms. The same goes for querying the "adverse" column in the
> "report_adverses" table: under 20 ms.
>

> This indicates to me that indeed the join itself causes a major
> performance bottleneck.
>
> I'm running the cluster from an SSD drive, as a traditional HDD could not
> even manage the query in under 5 minutes. The system has a total memory of
> 24 GB, runs on Debian and uses an 4Ghz 8 core i7-4790 processor.
>
> Some important postgresql.conf readouts:
>
> - shared_buffers = 4GB
> - work_mem = 64MB
> - maintenance_work_mem = 1GB
> - checkpoint_segments = 50
> - checkpoint_completion_target = 0.9
> - autovacuum = on
>
> Is there something I am missing here? Any help on getting this join faster
> is much appreciated.
>
> Cheers,
> Tim
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>
--
--
Best Regards
Sameer Kumar | DB Solution Architect
*ASHNIK PTE. LTD.*

101 Cecil Street, #11-11 Tong Eng Building, Singapore 069 533

T: +65 6438 3504 | M: +65 8110 0350 | www.ashnik.com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message David Rowley 2016-04-27 02:09:04 Re: Slow join over three tables
Previous Message Tim van der Linden 2016-04-26 23:27:50 Slow join over three tables