Hash join seq scan slow

From: Aldo Sarmiento <aldo(at)bigpurpledot(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Hash join seq scan slow
Date: 2016-04-19 08:07:48
Message-ID: CAHX=r6yJzUgkLbG7ZBKdPRt2kZEHiePL7rpLDwg5bTrOZUH6eg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello,

I'm assuming this topic has probably been bludgeoned to a pulp, but my
google-fu can't seem to find a solution.

I have two relatively largish tables that I'm trying to join that result in
a slow query.

Hardware:

2014 iMac w/ SSD & i5 processor

Tables:
contacts: 1.14 million rows
permissions: 2.49 million rows

contacts have many permissions

Goal: get first page of contacts (limit 40) a user has access to, sorted by
creation date.

For simplicity's sake, I've taken out some of the complexity of how I
retrieve the permissions & just got all permissions with id less than
2,100,000 to simulate access of ~151k contacts.

stage4=# EXPLAIN ANALYZE WITH perms AS (
stage4(# SELECT DISTINCT(contact_id) from permissions where id < 2100000
stage4(# ) SELECT
stage4-# contacts.id,
stage4-# contacts.first_name
stage4-# FROM contacts
stage4-# INNER JOIN perms ON perms.contact_id = contacts.id
stage4-# ORDER BY contacts.updated_at desc NULLS LAST LIMIT 40 OFFSET 0;
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=226777.45..226777.55 rows=40 width=20) (actual
time=1556.133..1556.143 rows=40 loops=1)
CTE perms
-> HashAggregate (cost=34891.61..35404.25 rows=51264 width=4)
(actual time=107.107..151.455 rows=151920 loops=1)
Group Key: permissions.contact_id
-> Bitmap Heap Scan on permissions (cost=3561.50..34416.43
rows=190074 width=4) (actual time=14.839..50.445 rows=192372 loops=1)
Recheck Cond: (id < 2100000)
Heap Blocks: exact=2435
-> Bitmap Index Scan on permissions_pkey
(cost=0.00..3513.98 rows=190074 width=0) (actual time=14.496..14.496
rows=192372 loops=1)
Index Cond: (id < 2100000)
-> Sort (cost=191373.19..191501.35 rows=51264 width=20) (actual
time=1556.132..1556.137 rows=40 loops=1)
Sort Key: contacts.updated_at DESC NULLS LAST
Sort Method: top-N heapsort Memory: 27kB
-> Hash Join (cost=180911.60..189752.76 rows=51264 width=20)
(actual time=1124.969..1532.269 rows=124152 loops=1)
Hash Cond: (perms.contact_id = contacts.id)
-> CTE Scan on perms (cost=0.00..1025.28 rows=51264
width=4) (actual time=107.110..203.330 rows=151920 loops=1)
-> Hash (cost=159891.71..159891.71 rows=1144871 width=20)
(actual time=1017.354..1017.354 rows=1145174 loops=1)
Buckets: 65536 Batches: 32 Memory Usage: 2521kB
-> Seq Scan on contacts (cost=0.00..159891.71
rows=1144871 width=20) (actual time=0.035..684.361 rows=1145174 loops=1)
Planning time: 0.222 ms
Execution time: 1561.693 ms
(20 rows)

It is to my understanding that the query requires the entire 150k matched
contacts to be joined in order for the Sort to run its operation. I don't
see a way around this, however, I can't also see how this can be a unique
case where it's acceptable to have a 1.5 second query time? There has to be
lots of other companies out there that manage much more data that needs to
be sorted for the presentation layer.

Thanks in advance,

-Aldo

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Jeff Janes 2016-04-20 00:04:40 Re: Hash join seq scan slow
Previous Message Evgeniy Shishkin 2016-04-15 09:31:59 Re: Slow update on column that is part of exclusion constraint