window function induces full table scan

From: Thomas Mayer <thomas(dot)mayer(at)student(dot)kit(dot)edu>
To: pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: window function induces full table scan
Date: 2014-01-02 21:32:01
Message-ID: 52C5DAD1.20109@student.kit.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

When querying a view with a WHERE condition, postgresql normally is able
to perform an index scan which reduces time for evaluation dramatically.

However, if a window function is evaluated in the view, postgresql is
evaluating the window function before the WHERE condition is applied.
This induces a full table scan.

These are the results of EXPLAIN:
-- without window function (non-equivalent)
explain select * from without_window_function where user_id = 43;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Index Scan using idx_checkin_node_user_id on checkin_node
(cost=0.43..26.06 rows=2 width=20)
Index Cond: (user_id = 43)
Filter: (((id % 1000) + 1) = 1)

-- with window function
explain select * from last_position where user_id = 43;
QUERY PLAN
------------------------------------------------------------------------------------------
Subquery Scan on tmp_last_position (cost=973803.66..1151820.09 rows=2
width=20)
Filter: ((tmp_last_position.datepos = 1) AND
(tmp_last_position.user_id = 43))
-> WindowAgg (cost=973803.66..1080613.52 rows=4747105 width=32)
-> Sort (cost=973803.66..985671.42 rows=4747105 width=32)
Sort Key: checkin_node.user_id, checkin_node.date,
checkin_node.id
-> Seq Scan on checkin_node (cost=0.00..106647.05
rows=4747105 width=32)

To work around this, I avoid using a view for that (equivalent):
EXPLAIN SELECT user_id, latitude, longitude
FROM (
SELECT
user_id,
latitude,
longitude,
rank() OVER (PARTITION BY user_id ORDER BY date DESC, id
DESC) AS datepos
FROM checkin_node
WHERE user_id = 43
) AS tmp_last_position
WHERE datepos = 1; -- takes 2 ms
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Subquery Scan on tmp_last_position (cost=39.70..52.22 rows=2 width=20)
Filter: (tmp_last_position.datepos = 1)
-> WindowAgg (cost=39.70..47.40 rows=385 width=32)
-> Sort (cost=39.70..40.67 rows=385 width=32)
Sort Key: checkin_node.date, checkin_node.id
-> Index Scan using idx_checkin_node_user_id on
checkin_node (cost=0.43..23.17 rows=385 width=32)
Index Cond: (user_id = 43)

I would expect postgresql to apply this query plan also for the view
last_position. It's 6621ms vs. 2ms, so the speedup is 3310!

Is it a bug in the optimizer?

How to reproduce:
=================
OS: ubuntu 12.04
Postgresql v9.3.2

get some sample data:
wget -qO-
http://snap.stanford.edu/data/loc-brightkite_totalCheckins.txt.gz|gunzip
-c|dos2unix|awk '{ if (length($0) > 20) print }'>test.csv

execute psql script:

\timing on
BEGIN;
DROP TABLE IF EXISTS checkin_node CASCADE;
CREATE TABLE checkin_node (
id SERIAL NOT NULL PRIMARY KEY,
user_id INTEGER NOT NULL,
date TIMESTAMP NOT NULL,
latitude DOUBLE PRECISION NOT NULL,
longitude DOUBLE PRECISION NOT NULL,
original_id VARCHAR NOT NULL
);
\COPY checkin_node (user_id, date, latitude, longitude, original_id)
FROM 'test.csv' WITH DELIMITER E'\t';

ALTER TABLE checkin_node DROP COLUMN original_id;
CREATE INDEX idx_checkin_node_user_id ON checkin_node(user_id);
CREATE INDEX idx_checkin_node_date ON checkin_node(date);

COMMIT;

VACUUM ANALYZE checkin_node;

-- doing window function in a view

DROP VIEW IF EXISTS last_position CASCADE;
CREATE VIEW last_position (user_id, latitude, longitude) AS (
SELECT user_id, latitude, longitude
FROM (
SELECT
user_id,
latitude,
longitude,
rank() OVER (PARTITION BY user_id ORDER BY date DESC, id
DESC) AS datepos
FROM checkin_node
) AS tmp_last_position
WHERE datepos = 1
);

select * from last_position where user_id = 43; -- takes 6621ms

-- similar view but without window function (non-equivalent)

DROP VIEW IF EXISTS without_window_function CASCADE;
CREATE VIEW without_window_function (user_id, latitude, longitude) AS (
SELECT user_id, latitude, longitude
FROM (
SELECT
user_id,
latitude,
longitude,
(id % 1000)+1 AS datepos --to not use a constant here
FROM checkin_node
) AS tmp_last_position
WHERE datepos = 1
);
select * from without_window_function where user_id = 43; -- takes 10ms

-- workaround: avoid using views (equivalent)

SELECT user_id, latitude, longitude
FROM (
SELECT
user_id,
latitude,
longitude,
rank() OVER (PARTITION BY user_id ORDER BY date DESC, id DESC)
AS datepos
FROM checkin_node
WHERE user_id = 43
) AS tmp_last_position
WHERE datepos = 1; -- takes 2 ms

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2014-01-02 21:52:50 Re: window function induces full table scan
Previous Message Dave Johansen 2014-01-02 19:47:42 Re: DATE_TRUNC() and GROUP BY?