Re: Approach to Data Summary and Analysis

From: Roxanne Reid-Bennett <rox(at)tara-lu(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Approach to Data Summary and Analysis
Date: 2014-04-15 22:26:28
Message-ID: 534DB214.9030003@tara-lu.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-performance

On 4/14/2014 12:27 PM, Robert DiFalco wrote:
> And so on for calls_connected, calls_completed, call_errors, etc.
>
> Occasionally I will want to know things like "When was the last time a
> user answered a call" or "How many times has a user been called".
> ...
> Sometimes I might want to get this data for a whole bunch of users.
> ...
> So the other option is to create a call_summary table that is updated
> with triggers.
> ...
> My only issue with a summary table is that I don't want a bunch of
> null fields.
> ...
> But over time all fields would eventually become non-NULL.
>
> So that leads me to a summary table for EACH call state. This approach
> has the down side that it creates a lot of tables and triggers. It has
> the upside of being pretty efficient without having to deal with NULL
> values. It's also pretty easy to reason about.
> ...
> So for my question -- is the choice between these a personal
> preference sort of thing or is there a right or wrong approach? Am I
> missing another approach that would be better? I'm okay with SQL but
> I'm not expert so I'm not sure if there is an accepted DESIGN PATTERN
> for this that I am missing.
>
There is no right or wrong - there is better, worse, best, and worst for
any specific scenario. In my experience, most people have time/money to
get to an 80% "better" design than all the other answers during design
and then it gets refined over time. And yes, personal experience does
play a part in how people interpret better/worse [aka religion] ;)

I didn't see anybody ask these questions - and to identify "better" -
they have to be asked.
1. How much data are you feeding into your system how fast?
this directly affects your choices on distribution, parallel
processing... writes vs updates vs triggers for copying vs all reads
[and if on bare metal - potentially where you place your logs,
indexes, core lookup tables, etc]
2. How much data are you reading out of your system - how fast?
you have given "simple" use cases (how many calls completed
within a time frame or to a number)
you have given very slightly more complex use cases (when was
the last time John answered a call)
you have given a slightly more bulky processing question of (how
many times have these users been called)
So...
a) How many users executing read queries do you have?
b) What is the expected load for simple queries (per
week/day/hour/minute - depending upon your resolution on speed)
c) What is the expected load for your mid-line complex queries
d) What is the "maximum" volume you expect a bulk query to go
after (like all users in the last 18 years, or this city's users in the
last day?) and how frequently will that kind of query be executed? How
much tolerance for delay do your users have?
e) do you have any known really complex queries that might bog
the system down?
f) How much lag time can you afford between capture and reporting?

Answers to the above define your performance requirements - which
defines the style of schema you need. Queries can be written to pull
data from any schema design - but how fast they can perform or how
easily they can be created...

Chris and Vincent both targeted a balance between writes and reads -
which adequately answers 80-85% of the usages out there. But you didn't
give us any of the above - so their recommendation (while very likely
valid) may not actually fit your case at all.

As to design patterns -
"Generally" a database schema is more normalized for an operational
system because normalization results in fewer writes/updates and lowers
the risk of corruption if a failure takes place. It also isolates
updates for any specific value to one location minimizing internally
caused data corruption.
Reporting systems are generally less normalized because writes are more
one-time and reads are where the load occurs.
Sometimes you have to use data replication to have a system that
appropriately supports both.

you have shown you are already aware of normalization.
If you weren't aware of approaches to Data Warehousing... you can review
information about how it is accomplished
- see the blogs on kimballgroup DOT com they cover a lot of high(er)
level concepts with enough specificity to be of some direct use.
[that website also covers some ideas for "Big Data" which aren't
necessarily limited to RDBMS']

Specify your performance requirements, then figure out your schema design.

FWIW I don't understand your (or any other person's) hesitancy for "lots
of" "NULL" values. They provide meaning in a number of different
ways... not the least of which is that you don't know (yet) - which is
knowledge in and of itself.

Roxanne

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2014-04-16 00:14:26 Re: [GENARAL] round() bug
Previous Message santhosh kumar 2014-04-15 19:34:12 Re: Need some help in postgres locking mechanism

Browse pgsql-performance by date

  From Date Subject
Next Message Robert DiFalco 2014-04-16 01:10:10 Re: Approach to Data Summary and Analysis
Previous Message Matheus de Oliveira 2014-04-15 20:23:41 Re: Testing strategies