From: | Mike Toews <mwtoews(at)sfu(dot)ca> |
---|---|
To: | Joel Nothman <jnothman(at)student(dot)usyd(dot)edu(dot)au> |
Cc: | David Wilson <david(dot)t(dot)wilson(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org |
Subject: | Re: Select ranges based on sequential breaks |
Date: | 2009-06-22 18:41:44 |
Message-ID: | 4A3FD068.30900@sfu.ca |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hi Joel,
Window functions appear to be the best solution for this style of
problem, and I'm looking forward to their applications. However, I'm
sticking with 8.3 for at least a year, so I'm not able to explore this
solution yet. For now, I can only post-process the output in a non-SQL
environment. I also need to do other fun stuff, like cumulative sums,
which is also challenging with SQL, but much easier and intuitive with R.
An excellent book that I recently stumbled on is /Joe Celko's SQL for
Smarties/ (recommended by someone on this list), which is a heavy read,
but has an amazing depth to ANSI SQL. Of particular interest is "Chapter
24: Regions, Runs, Gaps, Sequences, and Series". I'm slowly digesting this.
-Mike
Joel Nothman wrote:
> Hi Mike,
>
> I happened upon your query, which is related to some stuff I've been
> playing with.
>
> Firstly, David's solution below doesn't work. I haven't yet tried to work
> out why.
>
> Secondly, I was hoping to be able to solve your problem nicely with
> Postgres 8.4's window functions [1,2], which can provide access to
> data in sequentially-related rows.
>
> Given the following setup:
> CREATE TABLE foo (dt date, bin varchar(4));
> INSERT INTO foo VALUES ('2009-01-01', 'red'), ('2009-01-02', 'red'),
> ('2009-01-03', 'blue'), ('2009-01-04', 'blue'), ('2009-01-05', 'blue'),
> ('2009-01-06', 'red'), ('2009-01-07', 'blue'), ('2009-01-08', 'blue'),
> ('2009-01-09', 'red'), ('2009-01-10', 'red');
>
> I had hoped the following would suffice:
> SELECT first_value(dt) OVER w, last_value(dt) OVER w, bin FROM foo WINDOW
> w AS (ORDER BY dt PARTITION BY bin);
>
> Apparently, this is bad syntax. ORDER BY cannot precede PARTITION BY in a
> WINDOW declaration, and yet I wanted a grouping of date-consecutive bins,
> which (PARTITION BY bin ORDER BY dt) would not give me.
>
> I was able to produce the required result with:
>
> SELECT MIN(dt) AS first, MAX(dt) AS last, MAX(bin) AS bin
> FROM (SELECT dt, bin,
> CASE WHEN newbin IS NULL
> THEN 0
> ELSE SUM(newbin) OVER (ORDER BY dt)
> END AS binno
> FROM (SELECT *,
> (bin !=3D lag(bin, 1) OVER (ORDER BY dt))::int A=
> S
> newbin
> FROM foo
> ) AS newbins
> ) AS binnos
> GROUP BY binno
> ORDER BY first;
>
> This relies on a middle step in which I create an enumeration of the bins
> in sequence:
> SELECT dt, bin, CASE WHEN newbin IS NULL THEN 0 ELSE sum(newbin) OVER
> (ORDER BY dt) END AS binno FROM (SELECT *, (bin !=3D lag(bin, 1) OVER (ORDE=
> R
> BY dt))::int AS newbin FROM foo) AS newbins;
> dt | bin | binno
> ------------+------+-------
> 2009-01-01 | red | 0
> 2009-01-02 | red | 0
> 2009-01-03 | blue | 1
> 2009-01-04 | blue | 1
> 2009-01-05 | blue | 1
> 2009-01-06 | red | 2
> 2009-01-07 | blue | 3
> 2009-01-08 | blue | 3
> 2009-01-09 | red | 4
> 2009-01-10 | red | 4
>
> I would hope there is a neater way to do this with window functions.
>
> The best way to solve your problem may be with PL/SQL, which is also good
> at dealing with sequences (it's not as complicated as it looks!):
>
> CREATE TYPE bindates AS (first date, last date, bin varchar(5));
> CREATE OR REPLACE FUNCTION bindates()
> RETURNS SETOF bindates
> AS $$
> DECLARE
> row record;
> res bindates;
> BEGIN
> FOR row IN SELECT * FROM foo ORDER BY dt
> LOOP
> IF res.bin IS NULL OR res.bin !=3D row.bin THEN
> IF res.bin IS NOT NULL THEN
> RETURN NEXT res;
> END IF;
> res.bin :=3D row.bin;
> res.first :=3D row.dt;
> END IF;
> res.last :=3D row.dt;
> END LOOP;
> IF res.bin IS NOT NULL THEN
> RETURN NEXT res;
> END IF;
> END;
> $$ LANGUAGE plpgsql;
>
>
> Finally, I'll try to sort out David's solution. Once we correct his typo
> (t1.order -> t1.date) and add an 'ORDER BY first' to the end, we get:
> first | last | bin
> ------------+------------+------
> 2009-01-03 | 2009-01-05 | blue
> 2009-01-06 | 2009-01-06 | red
> 2009-01-07 | 2009-01-08 | blue
> 2009-01-09 | | red
>
> This includes correct output, but it fails on both edge cases. The
> non-appearance of the first row is due to the WHERE clause on the main
> SELECT statement:
> WHERE (SELECT bin
> FROM foo t2
> WHERE t2.dt < t1.dt
> ORDER BY dt DESC LIMIT 1) <> t1.bin
>
> If we drop this WHERE clause, we get:
> first | last | bin
> ------------+------------+------
> 2009-01-01 | 2009-01-02 | red
> 2009-01-02 | 2009-01-02 | red
> 2009-01-03 | 2009-01-05 | blue
> 2009-01-04 | 2009-01-05 | blue
> 2009-01-05 | 2009-01-05 | blue
> 2009-01-06 | 2009-01-06 | red
> 2009-01-07 | 2009-01-08 | blue
> 2009-01-08 | 2009-01-08 | blue
> 2009-01-09 | | red
> 2009-01-10 | | red
>
> We can therefore get the result including the first row by selecting from
> this table with 'GROUP BY last, bin'.
>
> And we can hack in a value for those final NULLs as a special case. The
> following statement works:
>
> SELECT MIN(first),
> CASE WHEN last IS NULL THEN (SELECT MAX(dt) FROM foo) ELSE last
> END,
> bin
> FROM (
> SELECT dt AS first,
> (SELECT dt
> FROM foo t3
> WHERE t3.dt < (
> SELECT dt
> FROM foo t5
> WHERE t5.dt > t1.dt
> AND t5.bin <> t1.bin
> ORDER BY dt ASC
> LIMIT 1)
> ORDER BY dt DESC
> LIMIT 1) AS last,
> bin
> FROM foo t1
> ) t0
> GROUP BY last, bin
> ORDER BY last;
>
>
> Finally, what's efficient? With 1,000,000 random rows, we get:
> Enumeration: 13s
> PL/SQL: 12s
> Modified David: minutes.
>
> [I used the following to create test data:
> CREATE OR REPLACE FUNCTION make_random(n int) RETURNS SETOF foo AS $$
> import random
> for i in xrange(n):
> m =3D random.randint(1,12)
> d =3D random.randint(1,28)
> b =3D random.choice(('red', 'blue'))
> yield '2009-%d-%d' % (m, d), b
> $$ LANGUAGE plpythonu;
> DELETE FROM foo; INSERT INTO foo (SELECT * FROM make_random(1000000));]
>
> I hope that helps you in considering various approaches to the problem.
>
> - Joel
>
>
> [1] http://developer.postgresql.org/pgdocs/postgres/tutorial-window.html
> [2] http://developer.postgresql.org/pgdocs/postgres/functions-window.html
>
>
> On Tue, 16 Jun 2009 06:06:16 +1000, David Wilson
> <david(dot)t(dot)wilson(at)gmail(dot)com> wrote:
>
>
From | Date | Subject | |
---|---|---|---|
Next Message | Scott Marlowe | 2009-06-22 18:56:58 | Re: Select ranges based on sequential breaks |
Previous Message | John DeSoi | 2009-06-22 18:40:38 | Re: Information about columns |