extract(year from date) doesn't use index but maybe could?

From: Jon Dufresne <jon(dot)dufresne(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: extract(year from date) doesn't use index but maybe could?
Date: 2015-04-19 17:16:29
Message-ID: CADhq2b51G0QrjNz_vH6e_SLyu=tR0A0G0zHhFsP7rqRr+6FA3w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Given the table:

CREATE TABLE dates (id SERIAL, d DATE NOT NULL, t TEXT NOT NULL)

With an *index* on field d. The following two queries are functionally
equivalent:

1. SELECT * FROM dates WHERE d >= '1900-01-01'
2. SELECT * FROM dates WHERE EXTRACT(year from d) >= 1900'

By functionally equivalent, they will return the same result set.

Query 2 does not use the index, adding a performance cost. It seems
there is an opportunity for optimization to handle these two queries
equivalently to take advantage of the index.

Some database abstraction layers have attempted to workaround this
limitation by rewriting EXTRACT(year ...) queries into a query more
like query 1. For example: Django's ORM does exctly this. Rather than
all abstraction layers trying to optimize this case, maybe it could be
pushed to the database layer.

I have written a test script that demonstrates that these functionally
equivalent queries have different performance characteristics. The
script and results are provide below:

RESULTS:

----
EXPLAIN SELECT * FROM dates WHERE d >= '1900-01-01'
QUERY PLAN
----------------------------------------------------------------------------
Bitmap Heap Scan on dates (cost=9819.23..26390.15 rows=524233 width=40)
Recheck Cond: (d >= '1900-01-01'::date)
-> Bitmap Index Scan on d_idx (cost=0.00..9688.17 rows=524233 width=0)
Index Cond: (d >= '1900-01-01'::date)
(4 rows)

EXPLAIN SELECT * FROM dates WHERE EXTRACT(year from d) >= 1900
QUERY PLAN
-------------------------------------------------------------------------------------------------
Seq Scan on dates (cost=0.00..37540.25 rows=524233 width=40)
Filter: (date_part('year'::text, (d)::timestamp without time zone)
>= 1900::double precision)
(2 rows)

Timing
select_without_extract: 284.233350s
select_with_extract: 323.106491s
----

SCRIPT:

----
#!/usr/bin/python3

import datetime
import subprocess
import random
import timeit
import sys

subprocess.check_call(['psql', 'postgres', '-c', 'DROP DATABASE IF
EXISTS datetest'], stdout=subprocess.DEVNULL)
subprocess.check_call(['psql', 'postgres', '-c', 'CREATE DATABASE
datetest'], stdout=subprocess.DEVNULL)
subprocess.check_call(['psql', 'datetest', '-c', 'CREATE TABLE dates
(id SERIAL, d DATE NOT NULL, t TEXT NOT NULL)'],
stdout=subprocess.DEVNULL)

def chunks(n, l):
i = 0
while i < len(l):
yield l[i:i+n]
i += n

d = datetime.date(1800, 1, 1)
today = datetime.date.today()
values = []
while d < today:
values.extend('(\'%s\', \'%s\')' % (d, d) for i in range(20))
d += datetime.timedelta(days=1)
random.shuffle(values)
for chunk in chunks(1000, values):
s = ','.join(chunk)
subprocess.check_call(['psql', 'datetest', '-c', 'INSERT INTO
dates (d, t) VALUES %s' % s], stdout=subprocess.DEVNULL)

subprocess.check_call(['psql', 'datetest', '-c', 'CREATE INDEX d_idx
ON dates (d)'], stdout=subprocess.DEVNULL)
print('EXPLAIN SELECT * FROM dates WHERE d >= \'1900-01-01\'')
sys.stdout.flush()
subprocess.check_call(['psql', 'datetest', '-c', 'EXPLAIN SELECT *
FROM dates WHERE d >= \'1900-01-01\''])
print('EXPLAIN SELECT * FROM dates WHERE EXTRACT(year from d) >= 1900')
sys.stdout.flush()
subprocess.check_call(['psql', 'datetest', '-c', 'EXPLAIN SELECT *
FROM dates WHERE EXTRACT(year from d) >= 1900'])

def select_without_extract():
subprocess.check_call(['psql', 'datetest', '-c', 'SELECT * FROM
dates WHERE d >= \'1900-01-01\''], stdout=subprocess.DEVNULL)

def select_with_extract():
subprocess.check_call(['psql', 'datetest', '-c', 'SELECT * FROM
dates WHERE EXTRACT(year from d) >= 1900'], stdout=subprocess.DEVNULL)

print('Timing')
sys.stdout.flush()

v = timeit.timeit('select_without_extract()', setup='from __main__
import select_without_extract', number=100)
print('select_without_extract: %fs' % v)
sys.stdout.flush()

v = timeit.timeit('select_with_extract()', setup='from __main__ import
select_with_extract', number=100)
print('select_with_extract: %fs' % v)
sys.stdout.flush()
---

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tomas Vondra 2015-04-19 17:42:12 Re: extract(year from date) doesn't use index but maybe could?
Previous Message Christian Gough 2015-04-17 18:25:15 Postgresql Host Swapping Hard With Abundant Free Memory