Quick Links

Approach to Data Summary and Analysis

From:	Robert DiFalco <robert(dot)difalco(at)gmail(dot)com>
To:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject:	Approach to Data Summary and Analysis
Date:	2014-04-14 16:27:29
Message-ID:	CAAXGW-zzACjHfREYEo4FBmJXLHD7d=nMH-i4WeSi_DTWBLzezQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general pgsql-performance

I have several related tables that represent a call state. Let's think of
these as phone calls to simplify things. Sometimes I need to determine the
last time a user was called, the last time a user answered a call, or the
last time a user completed a call.

The basic schema is something like this:

CREATE TABLE calls (
id BIGINT NOT NULL, // sequence generator
user_id BIGINT NOT NULL,
called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

PRIMARY KEY (id),
FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);

CREATE TABLE calls_answered (
id BIGINT NOT NULL,
answered TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

PRIMARY KEY (id),
FOREIGN KEY (id) REFERENCES calls(id) ON DELETE CASCADE
);

And so on for calls_connected, calls_completed, call_errors, etc.

Occasionally I will want to know things like "When was the last time a user
answered a call" or "How many times has a user been called".

I can do these queries using a combination of MAX or COUNT. But I'm
concerned about the performance.

SELECT MAX(a.id)
FROM calls_answered a JOIN calls c ON c.id = a.id
WHERE c.user_id = ?;

Or the number of answered calls:

SELECT MAX(a.id)
FROM calls_answered a JOIN calls c ON c.id = a.id
WHERE c.user_id = ?;

Sometimes I might want to get this data for a whole bunch of users. For
example, "give me all users whose have not answered a call in the last 5
days." Or even "what percentage of users called actually answered a call."
This approach could become a performance issue. So the other option is to
create a call_summary table that is updated with triggers.

The summary table would need fields like "user_id", "last_call_id",
"call_count", "last_answered_id", "answered_count", "last_completed_id",
"last_completed_count", etc.

My only issue with a summary table is that I don't want a bunch of null
fields. For example, if the user was *called* but they have never *answered* at
call then the last_call_id and call_count fields on the summary table would
be non-NULL but the last_answer_id and answer_count fields WOULD be NULL.
But over time all fields would eventually become non-NULL.

So that leads me to a summary table for EACH call state. Each summary table
would have a user id, a ref_id, and a count -- one summary table for each
state e.g. call_summary, call_answered_summary, etc.

This approach has the down side that it creates a lot of tables and
triggers. It has the upside of being pretty efficient without having to
deal with NULL values. It's also pretty easy to reason about.

So for my question -- is the choice between these a personal preference
sort of thing or is there a right or wrong approach? Am I missing another
approach that would be better? I'm okay with SQL but I'm not expert so I'm
not sure if there is an accepted DESIGN PATTERN for this that I am missing.

Thanks!

Responses

Re: Approach to Data Summary and Analysis at 2014-04-14 21:35:53 from Vincent Veyron
Re: [GENERAL] Approach to Data Summary and Analysis at 2014-04-15 14:56:00 from Chris Curvey
Re: Approach to Data Summary and Analysis at 2014-04-15 22:26:28 from Roxanne Reid-Bennett

Browse pgsql-general by date

	From	Date	Subject
Next Message	Andreas Joseph Krogh	2014-04-14 19:21:27	Querying all documents for a company and its projects etc
Previous Message	chiru r	2014-04-14 14:56:36	Re: lpgport issue while installing pg_bulkload utility on fedora 14 OS

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Borodin Vladimir	2014-04-14 16:42:08	Re: Checkpoint distribution
Previous Message	Kevin Grittner	2014-04-14 15:54:03	Re: SSI slows down over time