Apparent missed query optimization with self-join and inner grouping

From: Zack Weinberg <zackw(at)panix(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Apparent missed query optimization with self-join and inner grouping
Date: 2020-07-31 15:07:39
Message-ID: CAKCAbMh5V51XS1UQukk9t49tz5PfdTd56e61Hjt6u5wRVQf93g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I have a table recording the results of a web crawl. (Table
definition at the end of this message.) The relevant part of the data
stored in it looks like this:

id | url_id | full_url_id | experiment_id | redirect_num
------+--------+-------------+---------------+--------------
2617 | 1312 | 1312 | 16 | 0
2618 | 1312 | 2311 | 16 | 1
2619 | 1312 | 2312 | 16 | 2
2620 | 1312 | 2313 | 16 | 3
2631 | 1320 | 1320 | 43 | 0
2633 | 1320 | 2312 | 43 | 2
2632 | 1320 | 2317 | 43 | 1
2634 | 1320 | 2318 | 43 | 3

For each (experiment_id, url_id) pair for some small subset of the
experiment_ids, I need to query the full_url_id corresponding to the
*largest* value of redirect_num. The query planner does something
reasonable with this SELECT:

=> explain (analyze, verbose)
select b.experiment_id, b.url_id, b.full_url_id
from blockpage b,
(select experiment_id, url_id, max(redirect_num) as redirect_num
from blockpage group by experiment_id, url_id) bm
where b.experiment_id = bm.experiment_id
and b.url_id = bm.url_id
and b.redirect_num = bm.redirect_num
and bm.experiment_id in (16, 43);

Nested Loop (cost=1.14..88505.96 rows=20 width=12) (actual
time=0.041..1.723 rows=501 loops=1)
Output: b.experiment_id, b.url_id, b.full_url_id
-> GroupAggregate (cost=0.57..15442.73 rows=8543 width=12)
(actual time=0.033..0.501 rows=501 loops=1)
Output: blockpage.experiment_id, blockpage.url_id,
max(blockpage.redirect_num)
Group Key: blockpage.experiment_id, blockpage.url_id
-> Index Only Scan using
blockpage_experiment_id_url_id_redirect_num_blockpage_reason__ on
iclab.blockpage (cost=0.57..15293.19 rows=8547 width=12) (actual
time=0.026..0.283 rows=803 loops=1)
Output: blockpage.experiment_id, blockpage.url_id,
blockpage.full_url_id, blockpage.redirect_num, blockpage.html_tag_id
Index Cond: (blockpage.experiment_id = ANY
('{16,43}'::integer[]))
Heap Fetches: 803
-> Index Only Scan using
blockpage_experiment_id_url_id_redirect_num_blockpage_reason__ on
iclab.blockpage b (cost=0.57..8.53 rows=1 width=16) (actual
time=0.002..0.002 rows=1 loops=501)
Output: b.experiment_id, b.url_id, b.full_url_id,
b.redirect_num, b.html_tag_id
Index Cond: ((b.experiment_id = blockpage.experiment_id) AND
(b.url_id = blockpage.url_id) AND (b.redirect_num =
(max(blockpage.redirect_num))))
Heap Fetches: 501
Planning Time: 0.331 ms
Execution Time: 1.784 ms

But if I change the final part of the WHERE to reference
b.experiment_id instead of bm.experiment_id, I get this much more
expensive query plan:

=> explain (analyze, verbose)
select b.experiment_id, b.url_id, b.full_url_id
from blockpage b,
(select experiment_id, url_id, max(redirect_num) as redirect_num
from blockpage group by experiment_id, url_id) bm
where b.experiment_id = bm.experiment_id
and b.url_id = bm.url_id
and b.redirect_num = bm.redirect_num
and b.experiment_id in (16, 43);

Hash Join (cost=2749504.19..2764864.13 rows=2 width=12) (actual
time=144028.343..144029.545 rows=501 loops=1)
Output: b.experiment_id, b.url_id, b.full_url_id
Inner Unique: true
Hash Cond: ((b.experiment_id = blockpage.experiment_id) AND
(b.url_id = blockpage.url_id) AND (b.redirect_num =
(max(blockpage.redirect_num))))
-> Index Only Scan using
blockpage_experiment_id_url_id_redirect_num_blockpage_reason__ on
iclab.blockpage b (cost=0.57..15293.19 rows=8547 width=16) (actual
time=0.039..0.387 rows=803 loops=1)
Output: b.experiment_id, b.url_id, b.full_url_id,
b.redirect_num, b.html_tag_id
Index Cond: (b.experiment_id = ANY ('{16,43}'::integer[]))
Heap Fetches: 803
-> Hash (cost=2595219.62..2595219.62 rows=8816229 width=12)
(actual time=143941.931..143941.931 rows=57061228 loops=1)
Output: blockpage.experiment_id, blockpage.url_id,
(max(blockpage.redirect_num))
Buckets: 67108864 (originally 16777216) Batches: 1
(originally 1) Memory Usage: 2976138kB
-> HashAggregate (cost=2418895.04..2507057.33 rows=8816229
width=12) (actual time=90020.851..122656.924 rows=57061228 loops=1)
Output: blockpage.experiment_id, blockpage.url_id,
max(blockpage.redirect_num)
Group Key: blockpage.experiment_id, blockpage.url_id
-> Seq Scan on iclab.blockpage (cost=0.00..1757677.88
rows=88162288 width=12) (actual time=0.020..32910.709 rows=88164599
loops=1)
Output: blockpage.id, blockpage.url_id,
blockpage.full_url_id, blockpage.experiment_id,
blockpage.blockpage_reason_id, blockpage.html_tag_id,
blockpage.body_len, blockpage.blockpage_diff,
blockpage.has_blockpage_regex, blockpage.http_status,
blockpage.redirect_num, blockpage.response_failure
Planning Time: 0.409 ms
JIT:
Functions: 17
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 2.098 ms, Inlining 27.226 ms, Optimization
140.416 ms, Emission 79.840 ms, Total 249.580 ms
Execution Time: 145108.471 ms

Since we have `b.experiment_id = bm.experiment_id` as a join
condition, it seems to me that the query planner ought to have
recognized that the `experiment_id in (16, 43)` condition could be
pushed down into the sub-select in both cases.

What is the best way to report this to the developers? Should I file
a bug report? I'm using Postgres 12.2.

Thanks,
zw

CREATE TABLE iclab.blockpage (
id BIGSERIAL PRIMARY KEY,
url_id INTEGER NOT NULL,
full_url_id INTEGER NOT NULL,
experiment_id INTEGER NOT NULL,
blockpage_reason_id INTEGER,
html_tag_id INTEGER NOT NULL,
body_len INTEGER,
blockpage_diff REAL,
has_blockpage_regex BOOLEAN,
http_status INTEGER,
redirect_num INTEGER NOT NULL,
response_failure INTEGER,
FOREIGN KEY (experiment_id) REFERENCES iclab.experiment_results(id),
FOREIGN KEY(url_id) REFERENCES iclab.URL(id),
FOREIGN KEY(url_id) REFERENCES iclab.URL(id),
FOREIGN KEY(blockpage_reason_id) REFERENCES iclab.blockpage_reason(id),
FOREIGN KEY(html_tag_id) REFERENCES iclab.html_tag(id)
);

create unique index
blockpage__experiment_id_url_id_redirect_num_blockpage_reason
on iclab.blockpage(experiment_id , url_id, full_url_id,
redirect_num, html_tag_id);

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ben Chobot 2020-08-01 14:55:27 12.3 replicas falling over during WAL redo
Previous Message Allan Kamau 2020-07-31 10:50:27 Re: Querying PostgreSQL / PostGIS Databases in Python