Re: Approach to Data Summary and Analysis

From: Chris Curvey <chris(at)chriscurvey(dot)com>
To: Robert DiFalco <robert(dot)difalco(at)gmail(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Approach to Data Summary and Analysis
Date: 2014-04-15 15:12:55
Message-ID: CADfwSsD1w-p3SH=woCPrcbrzkJ0mzJmSzfT61JSCaTk9y-jL0w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-performance

On Tue, Apr 15, 2014 at 10:56 AM, Chris Curvey <chris(at)chriscurvey(dot)com>wrote:

> On Mon, Apr 14, 2014 at 12:27 PM, Robert DiFalco <robert(dot)difalco(at)gmail(dot)com
> > wrote:
>
>> I have several related tables that represent a call state. Let's think of
>> these as phone calls to simplify things. Sometimes I need to determine the
>> last time a user was called, the last time a user answered a call, or the
>> last time a user completed a call.
>>
>> The basic schema is something like this:
>>
>> CREATE TABLE calls (
>> id BIGINT NOT NULL, // sequence generator
>> user_id BIGINT NOT NULL,
>> called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
>>
>> PRIMARY KEY (id),
>> FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
>> );
>>
>> CREATE TABLE calls_answered (
>> id BIGINT NOT NULL,
>> answered TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
>>
>> PRIMARY KEY (id),
>> FOREIGN KEY (id) REFERENCES calls(id) ON DELETE CASCADE
>> );
>>
>>
>> And so on for calls_connected, calls_completed, call_errors, etc.
>>
>> Occasionally I will want to know things like "When was the last time a
>> user answered a call" or "How many times has a user been called".
>>
>> I can do these queries using a combination of MAX or COUNT. But I'm
>> concerned about the performance.
>>
>> SELECT MAX(a.id)
>> FROM calls_answered a JOIN calls c ON c.id = a.id
>> WHERE c.user_id = ?;
>>
>>
>> Or the number of answered calls:
>>
>> SELECT MAX(a.id)
>> FROM calls_answered a JOIN calls c ON c.id = a.id
>> WHERE c.user_id = ?;
>>
>>
>> Sometimes I might want to get this data for a whole bunch of users. For
>> example, "give me all users whose have not answered a call in the last 5
>> days." Or even "what percentage of users called actually answered a call."
>> This approach could become a performance issue. So the other option is to
>> create a call_summary table that is updated with triggers.
>>
>> The summary table would need fields like "user_id", "last_call_id",
>> "call_count", "last_answered_id", "answered_count", "last_completed_id",
>> "last_completed_count", etc.
>>
>> My only issue with a summary table is that I don't want a bunch of null
>> fields. For example, if the user was *called* but they have never
>> *answered* at call then the last_call_id and call_count fields on the
>> summary table would be non-NULL but the last_answer_id and answer_count
>> fields WOULD be NULL. But over time all fields would eventually become
>> non-NULL.
>>
>> So that leads me to a summary table for EACH call state. Each summary
>> table would have a user id, a ref_id, and a count -- one summary table for
>> each state e.g. call_summary, call_answered_summary, etc.
>>
>> This approach has the down side that it creates a lot of tables and
>> triggers. It has the upside of being pretty efficient without having to
>> deal with NULL values. It's also pretty easy to reason about.
>>
>> So for my question -- is the choice between these a personal preference
>> sort of thing or is there a right or wrong approach? Am I missing another
>> approach that would be better? I'm okay with SQL but I'm not expert so I'm
>> not sure if there is an accepted DESIGN PATTERN for this that I am missing.
>>
>> Thanks!
>>
>>
>>
> (Sorry, fat-fingered and hit "send too early"...)

CREATE TABLE calls (
id BIGINT NOT NULL, // sequence generator
user_id BIGINT NOT NULL,
called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
answered TIMESTAMPTZ NULL,
connected TIMESTAMPTZ NULL,
completed TIMESTAMPTZ NULL,

PRIMARY KEY (id),
FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);

Then your queries end up looking like this:

--last time john answered
SELECT MAX(a.id)
FROM calls
where answered is not null
and user_id = ?

-- users that have not answered a call in the last five days (I can think
of a few ways to interpret that phrase)
select myusers.*
from myusers
where not exists
( select *
from calls
where calls.user_id = myusers.user_id
and answered >= <five days ago>)

-- average ring time
select avg(extract ('seconds' from called - answered))
where answered is not null

--
I asked the Internet how to train my cat, and the Internet told me to get a
dog.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Vincent Veyron 2014-04-15 15:41:11 Re: Approach to Data Summary and Analysis
Previous Message Chris Curvey 2014-04-15 14:56:00 Re: [GENERAL] Approach to Data Summary and Analysis

Browse pgsql-performance by date

  From Date Subject
Next Message Robert DiFalco 2014-04-15 15:53:04 Re: [GENERAL] Approach to Data Summary and Analysis
Previous Message Chris Curvey 2014-04-15 14:56:00 Re: [GENERAL] Approach to Data Summary and Analysis