Re: finding gaps in temporal data

From: Samuel Gendler <sgendler(at)ideasculptor(dot)com>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: finding gaps in temporal data
Date: 2011-06-15 20:52:27
Message-ID: BANLkTi=vD=MgfDjCaN7TrThbE3=1-KHXEw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

On Wed, Jun 15, 2011 at 10:23 AM, Samuel Gendler
<sgendler(at)ideasculptor(dot)com>wrote:

> I have a fact table that looks like this:
>
> dim1_fk bigint,
> time_fk bigint,
> tstamp timestamp without timezone
> value numeric(16,2)
>
> The tstamp column is completely redundant to a colume in the time_utc
> table, but I like to leave it there for convenience when writing ad-hoc
> queries in psql - it allows me to skip the join to a time dimension table.
> The fact table is actually partitioned into 1-month child tables, with
> check constraints on both time_fk and the tstamp column, since there is a
> 1:1 relationship between those columns.
>
> dim1_fk refers to a dim1 table which has two varchar columns we care about:
>
> dim1_pk bigint,
> label1 varchar,
> label2 varchar
>
> the time_utc table contains the usual time dimension columns, and I've
> structured the primary key to be an integer in the form YYYYMMDDHH24mm - so
> 2011-06-15 15:35 has a primary key of 201106151535 and there is a row every
> 5 minutes. All data in the fact table is assigned to a given 5 minute
> window. There is a row in the time_utc table for every possible time value,
> regardless of whether there is data in a fact table for that 5 minute
> interval. For our purposes, we only need care about 2 columns
>
> time_pk bigint,
> tstamp timstamp without time zone
>
> I'm looking to run a report which will show me any gaps in the data for any
> label1/label2 pair that appears in the dim1 table - there are 0 or more rows
> for each label1/label2 combination in each 5 minute window and I don't
> actually care about the duplicates (all my queries aggregate multiple rows
> for a given timestamp via the avg() function).
>

OK, I figured this one out on my own. It looks like this (plus a union all
to a query to show entries from dim1 for which there is no data at all):

SELECT d2.label1,
d2.label2,
tstamp - gap as start_time,
tstamp - '5 minute'::interval as end_time
FROM (
SELECT d.dim1_pk,
t.tstamp,
t.tstamp - lag(t.tstamp,1, '2011-05-15 00:00:00') OVER w AS gap
FROM
dim1 d
JOIN facts.fact_tbl f
ON f.dim1_fk = d.dim1_pk
JOIN time_utc t
ON t.time_pk = f.time_fk
WHERE f.tstamp between '2011-05-15 00:00:00' and '2011-06-03 23:55:00'
GROUP BY 1,2
WINDOW w AS (PARTITION BY d.dim1_pk
ORDER BY d.dim1_pk)
ORDER BY 1,2
) AS q JOIN dim1 d2 ON d2.dim1_pk = q.dim1_pk
WHERE q.gap > '5 minute'::interval
;

That subtracts the tstamp of the previous row from the tstamp of the current
row, within a window defined on individual dim1_pk values. The outer query
then selects only rows where the gap is greater than 5 minutes, since
sequential values will show a 5 minute interval. It also joins to dim1
again in order to pull out the label1 and label2 values, since including
those in the window instead of dim1_pk resulted in a much slower query,
presumably because I don't have indexes on the label columns.

This avoids all of the cross join and left join craziness I was doing, which
is useful when attempting to plug gaps with default values, but a pain in
the neck when just attempting to determine where the gaps are and how large
they are.

window functions are a seriously useful tool!

--sam

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Clem Dickey 2011-06-15 21:50:09 "<query expression body> ::= <joined table>" grammar rule not accepted by Postgres
Previous Message Jasen Betts 2011-06-15 20:02:52 Re: Storage of Indian Language text in postgresql through PHP