Merge joins on index scans

From: James Parks <james(dot)parks(at)meraki(dot)net>
To: pgsql-performance(at)postgresql(dot)org
Subject: Merge joins on index scans
Date: 2016-02-26 22:07:57
Message-ID: CAJ3Xv+hGjhFk_GSZzh2YqJFpSGStwmsNeM_9yLd=T_UHWdguSA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Dear psql-performance,

I'm having issues with a certain query, and I was hoping you could help me
out.

The schema:

(start with new database cluster, with either SQL_ASCII or en.us-UTF8
encoding, using the default server configuration available in the pgdg
Jessie packages).

CREATE TABLE a (id bigint primary key, nonce bigint);
CREATE TABLE b (id bigint primary key, a_id bigint not null);
CREATE INDEX a_idx ON b (a_id);

The query:

SELECT b.* FROM b JOIN a ON b.a_id = a.id WHERE a.nonce = ? ORDER BY b.id
ASC;

(skip down to [1] and [2] to see the query performance)

What I know:

If you force the query planner to use a merge join on the above query, it
takes 10+ minutes to complete using the data as per below. If you force the
query planner to use a hash join on the same data, it takes ~200
milliseconds. This behavior is the same both on Postgresql 9.2.15 and
Postgresql 9.4.6 (at least as provided by the Debian Jessie repo hosted by
postgresql.org), and happens both on i386 and x86_64 platforms. Note: the
data for these queries is the same as described below. Let me know if you
want me to provide the raw csv's or similar.

Creating a new Postgresql 9.4 cluster (64-bit), creating the tables (a) and
(b), importing the table data into the tables (a) and (b), and then running
the above query using EXPLAIN results in a merge join query plan, as in [1].

Creating a new Postgresql 9.2 cluster (32-bit or 64-bit), creating the
tables (a) and (b), importing the table data into (a) and (b), and then
running the above query results in a hash join query plan, as in [2].

When running query [1], the postgres process on the machine consumes 100%
CPU for a long time (it seems CPU-bound).

What I expected:

I expected both of the hash join and merge join implementations of this
query to have comparable query times; perhaps within an order of magnitude.
This was expected on my part mostly because the cost metrics for each query
were very similar. Instead, the "optimal" query plan for the query takes
more than 1000x longer.

I also expected that the "Rows Removed by Filter: " for the index scan on
(a) would not have such a high count, as the number of rows in table (a)
(~500,000) is significantly less than the count (2,201,063,696).

What I want to know:

- Is this expected behavior? Can you describe how the merge join algorithm
achieves these results?
- Can I avoid this issue by disabling merge joins in the server
configuration?

Configuration:

The configuration of the database is the sample configuration as per the
Debian Jessie packages of Postgresql available at
http://apt.postgresql.org/pub/repos/apt/ with the exception that the data
directory was explicitly specified.

Information about the data:

Here are some queries that help describe the data I'm working with:

postgres=# select distinct a_id, count(*) from b group by a_id;
a_id | count
--------+-------
49872 | 320
47994 | 5
19084 | 82977
53251 | 100
109804 | 10
51738 | 5
49077 | 10

postgres=# select count(*) from b;
count
-------
83427

postgres=# select count(distinct nonce) from a;
count
-------
198

postgres=# select count(*) from a;
count
--------
490166

postgres=# select count(*) from a where nonce = 64;
count
-------
395

Hardware:

2015-era Intel Xeon processors
> 300 GB of ram (greater than the size of the database with a large margin)
database on hardware raid 1 array on 2 SSDs

Commentary:

This is my first bug report to a major open source project, so I apologize
in advance if I messed up this report. Let me know if I have left out key
details -- I'm happy to provide them.

Given that there are roughly 500k rows in table a, and given that the
EXPLAIN output claims that the filter (nonce = 64) caused 2 billion rows to
be skipped (refer to [1]) suggests that each row in table b is being
compared to a non-negligible number of rows in (a).

I can probably make this data available as a pg_dump file. Let me know if
you think that's necessary, and where I should upload it.

Regards,
James

[1]
postgres=# explain (analyze,buffers) select b.* from b join a on b.a_id =
a.id where a.nonce = 64 order by b.id asc;
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=7855.25..7855.61 rows=143 width=16) (actual
time=752058.415..752080.731 rows=83427 loops=1)
Sort Key: b.id
Sort Method: external merge Disk: 2096kB
Buffers: shared hit=2151025721 read=479, temp read=267 written=267
I/O Timings: read=2.384
-> Merge Join (cost=869.07..7850.13 rows=143 width=16) (actual
time=5.718..751760.637 rows=83427 loops=1)
Merge Cond: (b.a_id = a.id)
Buffers: shared hit=2151025721 read=479
I/O Timings: read=2.384
-> Index Scan using a_idx on b (cost=0.00..2953.35 rows=83427
width=16) (actual time=0.007..68.165 rows=83427 loops=1)
Buffers: shared hit=1303 read=139
I/O Timings: read=1.369
-> Index Scan using a_pkey on a (cost=0.00..26163.20 rows=843
width=8) (actual time=5.706..751385.306 rows=83658 loops=1)
Filter: (nonce = 64)
Rows Removed by Filter: 2201063696
Buffers: shared hit=2151024418 read=340
I/O Timings: read=1.015
Total runtime: 752092.206 ms

[2]
postgres=# explain (analyze,buffers) select b.* from b join a on b.a_id =
a.id where a.nonce = 64 order by b.id asc;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------
Sort (cost=10392.28..10392.64 rows=143 width=16) (actual
time=164.415..186.297 rows=83427 loops=1)
Sort Key: b.id
Sort Method: external merge Disk: 2112kB
Buffers: shared hit=2514 read=587, temp read=267 written=267
I/O Timings: read=1.199
-> Hash Join (cost=8787.61..10387.16 rows=143 width=16) (actual
time=61.836..113.434 rows=83427 loops=1)
Hash Cond: (b.a_id = a.id)
Buffers: shared hit=2514 read=587
I/O Timings: read=1.199
-> Seq Scan on b (cost=0.00..1285.27 rows=83427 width=16)
(actual time=0.011..15.826 rows=83427 loops=1)
Buffers: shared hit=449 read=2
I/O Timings: read=0.009
-> Hash (cost=8777.08..8777.08 rows=843 width=8) (actual
time=61.812..61.812 rows=395 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 16kB
Buffers: shared hit=2065 read=585
I/O Timings: read=1.190
-> Seq Scan on a (cost=0.00..8777.08 rows=843 width=8)
(actual time=0.143..61.609 rows=395 loops=1)
Filter: (nonce = 64)
Rows Removed by Filter: 489771
Buffers: shared hit=2065 read=585
I/O Timings: read=1.190
Total runtime: 198.014 ms

The above numbers were acquired using this version of Postgresql:
PostgreSQL 9.2.15 on x86_64-unknown-linux-gnu, compiled by gcc (Debian
4.9.2-10) 4.9.2, 64-bit

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message David Rowley 2016-02-28 10:06:35 Re: Merge joins on index scans
Previous Message David G. Johnston 2016-02-26 20:56:19 Re: Odd behavior with indices