Combining count() and row_number() as window functions

From: Thomas Kellerer <spam_eater(at)gmx(dot)net>
To: pgsql-general(at)postgresql(dot)org
Subject: Combining count() and row_number() as window functions
Date: 2017-01-19 12:59:05
Message-ID: o5qd6k$2ni$2@blaine.gmane.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I was playing around with a query that essentially looked something like this:

select row_number() over (order by foo_date) as rn,
count(*) over () as total_count,
f.*
from foo f;

(The actual query limits the output based on the row_number() for pagination purposes, but for this question this turned out to be irrelevant)

I assumed that the count() wouldn't increase the runtime of the query as the result of the row_number() can be used to calculate that.

However it turns out that using a scalar subquery to calculate the count() is faster despite the duplicated Seq Scan on the table:

select row_number() over (order by foo_date) as rn,
(select count(*) from foo) as total_count,
f.*
from foo f

I used the following test setup:

create table foo
as
select i,
date '2000-01-01' + (random() * 17 * 365)::int as foo_date,
'Some Text '||i as data
from generate_series(1,500000) g(i);

explain (analyze, verbose, buffers)
select row_number() over (order by foo_date) as rn,
count(*) over () as total_count,
f.*
from foo f;

explain (analyze, verbose, buffers)
select row_number() over (order by foo_date) as rn,
(select count(*) from foo) as total_count,
f.*
from foo f;

This is the plan for the first query

WindowAgg (cost=44191.13..49344.89 rows=429480 width=56) (actual time=1295.152..1491.642 rows=500000 loops=1)
Output: (row_number() OVER (?)), count(*) OVER (?), i, foo_date, data
Buffers: shared hit=2048 read=1531 dirtied=1531, temp read=5781 written=5078
I/O Timings: read=26.022
-> WindowAgg (cost=44191.13..47841.71 rows=429480 width=48) (actual time=611.887..987.440 rows=500000 loops=1)
Output: foo_date, i, data, row_number() OVER (?)
Buffers: shared hit=2048 read=1531 dirtied=1531, temp read=2123 written=2123
I/O Timings: read=26.022
-> Sort (cost=44191.13..45264.83 rows=429480 width=40) (actual time=611.872..723.216 rows=500000 loops=1)
Output: foo_date, i, data
Sort Key: f.foo_date
Sort Method: external merge Disk: 16976kB
Buffers: shared hit=2048 read=1531 dirtied=1531, temp read=2123 written=2123
I/O Timings: read=26.022
-> Seq Scan on stuff.foo f (cost=0.00..4008.48 rows=429480 width=40) (actual time=0.064..158.561 rows=500000 loops=1)
Output: foo_date, i, data
Buffers: shared hit=2048 read=1531 dirtied=1531
I/O Timings: read=26.022
Planning time: 0.711 ms
Execution time: 1523.306 ms

and this is the plan for the second query:

WindowAgg (cost=49273.31..52923.89 rows=429480 width=56) (actual time=660.543..1036.534 rows=500000 loops=1)
Output: row_number() OVER (?), $0, f.i, f.foo_date, f.data
Buffers: shared hit=7158, temp read=2123 written=2123
InitPlan 1 (returns $0)
-> Aggregate (cost=5082.18..5082.18 rows=1 width=8) (actual time=105.307..105.307 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=3579
-> Seq Scan on stuff.foo (cost=0.00..4008.48 rows=429480 width=0) (actual time=0.041..54.075 rows=500000 loops=1)
Output: foo.i, foo.foo_date, foo.data
Buffers: shared hit=3579
-> Sort (cost=44191.13..45264.83 rows=429480 width=40) (actual time=555.216..663.021 rows=500000 loops=1)
Output: f.foo_date, f.i, f.data
Sort Key: f.foo_date
Sort Method: external merge Disk: 16976kB
Buffers: shared hit=3579, temp read=2123 written=2123
-> Seq Scan on stuff.foo f (cost=0.00..4008.48 rows=429480 width=40) (actual time=0.030..107.520 rows=500000 loops=1)
Output: f.foo_date, f.i, f.data
Buffers: shared hit=3579
Planning time: 0.134 ms
Execution time: 1065.572 ms

I uploaded both plans in case formatting breaks the above:

First query: https://explain.depesz.com/s/BT8y
Second query: https://explain.depesz.com/s/cbTm

The major contributor to the runtime is obviously the order by which is to be expected
But I am surprised that adding the count(*) in the first query adds additional work as from my perspective the count() could be "derived" from the row_count().

Is this a case of "just not implemented yet" or a case of "to expensive to optimize"?

This is on 9.6.1

Regards
Thomas

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Kenneth Marshall 2017-01-19 13:56:26 Re: migrate Sql Server database to PostgreSql
Previous Message PAWAN SHARMA 2017-01-19 11:58:44 Moving from 9.5 to 9.6