From: | Thomas Mayer <thomas(dot)mayer(at)student(dot)kit(dot)edu> |
---|---|
To: | pgsql-performance <pgsql-performance(at)postgresql(dot)org> |
Subject: | window function induces full table scan |
Date: | 2014-01-02 21:32:01 |
Message-ID: | 52C5DAD1.20109@student.kit.edu |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
When querying a view with a WHERE condition, postgresql normally is able
to perform an index scan which reduces time for evaluation dramatically.
However, if a window function is evaluated in the view, postgresql is
evaluating the window function before the WHERE condition is applied.
This induces a full table scan.
These are the results of EXPLAIN:
-- without window function (non-equivalent)
explain select * from without_window_function where user_id = 43;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Index Scan using idx_checkin_node_user_id on checkin_node
(cost=0.43..26.06 rows=2 width=20)
Index Cond: (user_id = 43)
Filter: (((id % 1000) + 1) = 1)
-- with window function
explain select * from last_position where user_id = 43;
QUERY PLAN
------------------------------------------------------------------------------------------
Subquery Scan on tmp_last_position (cost=973803.66..1151820.09 rows=2
width=20)
Filter: ((tmp_last_position.datepos = 1) AND
(tmp_last_position.user_id = 43))
-> WindowAgg (cost=973803.66..1080613.52 rows=4747105 width=32)
-> Sort (cost=973803.66..985671.42 rows=4747105 width=32)
Sort Key: checkin_node.user_id, checkin_node.date,
checkin_node.id
-> Seq Scan on checkin_node (cost=0.00..106647.05
rows=4747105 width=32)
To work around this, I avoid using a view for that (equivalent):
EXPLAIN SELECT user_id, latitude, longitude
FROM (
SELECT
user_id,
latitude,
longitude,
rank() OVER (PARTITION BY user_id ORDER BY date DESC, id
DESC) AS datepos
FROM checkin_node
WHERE user_id = 43
) AS tmp_last_position
WHERE datepos = 1; -- takes 2 ms
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Subquery Scan on tmp_last_position (cost=39.70..52.22 rows=2 width=20)
Filter: (tmp_last_position.datepos = 1)
-> WindowAgg (cost=39.70..47.40 rows=385 width=32)
-> Sort (cost=39.70..40.67 rows=385 width=32)
Sort Key: checkin_node.date, checkin_node.id
-> Index Scan using idx_checkin_node_user_id on
checkin_node (cost=0.43..23.17 rows=385 width=32)
Index Cond: (user_id = 43)
I would expect postgresql to apply this query plan also for the view
last_position. It's 6621ms vs. 2ms, so the speedup is 3310!
Is it a bug in the optimizer?
How to reproduce:
=================
OS: ubuntu 12.04
Postgresql v9.3.2
get some sample data:
wget -qO-
http://snap.stanford.edu/data/loc-brightkite_totalCheckins.txt.gz|gunzip
-c|dos2unix|awk '{ if (length($0) > 20) print }'>test.csv
execute psql script:
\timing on
BEGIN;
DROP TABLE IF EXISTS checkin_node CASCADE;
CREATE TABLE checkin_node (
id SERIAL NOT NULL PRIMARY KEY,
user_id INTEGER NOT NULL,
date TIMESTAMP NOT NULL,
latitude DOUBLE PRECISION NOT NULL,
longitude DOUBLE PRECISION NOT NULL,
original_id VARCHAR NOT NULL
);
\COPY checkin_node (user_id, date, latitude, longitude, original_id)
FROM 'test.csv' WITH DELIMITER E'\t';
ALTER TABLE checkin_node DROP COLUMN original_id;
CREATE INDEX idx_checkin_node_user_id ON checkin_node(user_id);
CREATE INDEX idx_checkin_node_date ON checkin_node(date);
COMMIT;
VACUUM ANALYZE checkin_node;
-- doing window function in a view
DROP VIEW IF EXISTS last_position CASCADE;
CREATE VIEW last_position (user_id, latitude, longitude) AS (
SELECT user_id, latitude, longitude
FROM (
SELECT
user_id,
latitude,
longitude,
rank() OVER (PARTITION BY user_id ORDER BY date DESC, id
DESC) AS datepos
FROM checkin_node
) AS tmp_last_position
WHERE datepos = 1
);
select * from last_position where user_id = 43; -- takes 6621ms
-- similar view but without window function (non-equivalent)
DROP VIEW IF EXISTS without_window_function CASCADE;
CREATE VIEW without_window_function (user_id, latitude, longitude) AS (
SELECT user_id, latitude, longitude
FROM (
SELECT
user_id,
latitude,
longitude,
(id % 1000)+1 AS datepos --to not use a constant here
FROM checkin_node
) AS tmp_last_position
WHERE datepos = 1
);
select * from without_window_function where user_id = 43; -- takes 10ms
-- workaround: avoid using views (equivalent)
SELECT user_id, latitude, longitude
FROM (
SELECT
user_id,
latitude,
longitude,
rank() OVER (PARTITION BY user_id ORDER BY date DESC, id DESC)
AS datepos
FROM checkin_node
WHERE user_id = 43
) AS tmp_last_position
WHERE datepos = 1; -- takes 2 ms
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2014-01-02 21:52:50 | Re: window function induces full table scan |
Previous Message | Dave Johansen | 2014-01-02 19:47:42 | Re: DATE_TRUNC() and GROUP BY? |