finding gaps in temporal data

From: Samuel Gendler <sgendler(at)ideasculptor(dot)com>
To: pgsql-sql(at)postgresql(dot)org
Subject: finding gaps in temporal data
Date: 2011-06-15 17:23:51
Message-ID: BANLkTim=aT7_R+cwUUz2eAan+dytcU3jbg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

I have a fact table that looks like this:

dim1_fk bigint,
time_fk bigint,
tstamp timestamp without timezone
value numeric(16,2)

The tstamp column is completely redundant to a colume in the time_utc table,
but I like to leave it there for convenience when writing ad-hoc queries in
psql - it allows me to skip the join to a time dimension table. The fact
table is actually partitioned into 1-month child tables, with check
constraints on both time_fk and the tstamp column, since there is a 1:1
relationship between those columns.

dim1_fk refers to a dim1 table which has two varchar columns we care about:

dim1_pk bigint,
label1 varchar,
label2 varchar

the time_utc table contains the usual time dimension columns, and I've
structured the primary key to be an integer in the form YYYYMMDDHH24mm - so
2011-06-15 15:35 has a primary key of 201106151535 and there is a row every
5 minutes. All data in the fact table is assigned to a given 5 minute
window. There is a row in the time_utc table for every possible time value,
regardless of whether there is data in a fact table for that 5 minute
interval. For our purposes, we only need care about 2 columns

time_pk bigint,
tstamp timstamp without time zone

I'm looking to run a report which will show me any gaps in the data for any
label1/label2 pair that appears in the dim1 table - there are 0 or more rows
for each label1/label2 combination in each 5 minute window and I don't
actually care about the duplicates (all my queries aggregate multiple rows
for a given timestamp via the avg() function). I can find every missing row
in the fact table data by just cross joining dim1 and time_utc for a given
date range and then left joining between that and the fact table. Since
every possible time value appears in the time table and every possible
value1/value2 combination appears in the dim1 table, I will see NULLs in the
fact table columns for any row that is missing in the data set.

However, it would be far nicer to get a report which just lists gaps, with a
start and end, rather than one row for each missing 5 minute interval, even
if they are sequential. Is there any way to accomplish this? I'm running
8.4.x at the moment, but would be willing to upgrade to 9.0.x, if necessary.
If I can do it without the left outer join from time_utc to the fact table,
then so much the better.

A query to grab just the missing rows looks something like this:

SELECT t.tstamp,
d.label1,
d.label2
FROM dimensions.dim1 d
INNER JOIN dimensions.time_utc t
ON t.tstamp between '2011-06-01' and '2011-06-05'
LEFT OUTER JOIN facts.parent_fact f
ON f.time_fk = t.time_pk
AND f.dim1_fk = d.dim1_pk
GROUP BY 3,2,1
HAVING avg(value) IS NULL
ORDER BY 3,2,1

so all I really need to do is figure out how to combine sequential rows into
a single row with a start and end time. Rows are sequential if row1.tstamp
- row2.tstamp = 5 minutes

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Jasen Betts 2011-06-15 20:02:52 Re: Storage of Indian Language text in postgresql through PHP
Previous Message Oliveiros d'Azevedo Cristina 2011-06-15 16:49:53 Re: Merge overlapping time-periods