(apologies if this was sent twice)
Hello all. I have an academic project that I'm working on and, as I'm
relatively new to optimization techniques and database design, I'm
wondering if some of you can give me some pointers. Below is the schema
(to model baseball statistics), and I'm pretty much stuck with it at
this point. If anyone has any suggestions involving changing the
schema, I'd appreciate hearing them just for future reference.
There are several interesting queries that I've been constructing just
to get a feel for this schema, and given that some of these tables
contain 100,000-200,000 tuples, some queries are taking a good 5-10
seconds to execute. I'm just wondering if this is simply the fault of
my schema or are these queries poorly constructed? Here are some
queries I've been trying:
Here's a query for the top ten all time home run leaders:
select P.first_name, P.last_name, S.player_id, sum(B.hr) as hr,
(sum(B.h)::float / sum(B.ab)::float) ab
from statistics S, batting_stats B, players P
where S.id=B.id and S.player_id=P.id and B.ab>0
group by P.first_name, P.last_name, S.player_id
order by hr desc limit 10;
Select the 10 highest batting averages of all time where the batter had
600 or more at bats. Also gather the name, year, team, hits, at bats...
select (B.h::float / B.ab) as avg, B.h, B.ab, S.year, P.first_name,
P.last_name, T.city, T.name
from batting_stats B, statistics S, players P, Teams T
where B.ab > 600 and S.id=B.id and S.player_id=P.id and S.team_id=T.id
order by avg desc limit 10;
To find the names of the single season home run leaders, along with the
total number of home runs, the team name/city and the year:
select P.first_name, P.last_name, B.hr, T.name, T.city, S.year from
statistics S, batting_stats B, players P, teams T
where (S.id=B.id) and
(S.player_id=P.id) and
(B.hr>30) and
(T.id=S.team_id) limit 10;
You get the idea. These queries take a while. Is this just the way it
is or there things that can be done to optimize this?
One separate issue (the reason why the above examples are all about
batting statistics) I'm having is representing the innings pitched
statistic. The way it is often represented (and the way it is done in
this schema) is something like this 123.0 means exactly 123 innings
pitched, 123.1 means 123 1/3 innings, and 123.2 means 123 2/3 innings
pitched. I'm contemplating the best way to write a function that knows
how to sum these values accurately. Is this something that can be done
with PL/PGSQL or should I go straight to something like PLPERL?
Alternatively, I could research a way to represent fractions in the DB
and write a script to convert all values in this column. Any advice
here??
Thanks in advance for any thoughts, comments, suggestions.
-Jeff
create table leagues
(
-- create an integer id field for easier and efficient FK
referencing
id serial,
name varchar(50),
first_year integer not null
check (first_year > 1850 and first_year < 2010),
last_year integer
check (first_year > 1850 and first_year < 2010),
primary key(id),
constraint chronological_order
check (last_year >= first_year)
);
create table teams
(
id serial,
name varchar(50) not null,
city varchar(50),
first_year integer not null
check (first_year > 1850 and first_year < 2010),
last_year integer
check (first_year > 1850 and first_year < 2010),
alt_id integer,
league_id integer not null,
primary key(id),
foreign key(alt_id) references teams(id),
foreign key(league_id) references leagues(id)
on delete cascade,
constraint chronological_order
check (last_year >= first_year)
);
create table players
(
id serial,
first_name varchar(30),
last_name varchar(30) not null,
bats char(1) check (bats in ('L','R','S')),
throws char(1) check (throws in ('L','R')),
dob date,
primary key(id)
);
create table statistics
(
id serial,
year integer not null,
g integer,
player_id integer not null,
team_id integer not null,
foreign key(player_id) references players(id)
on delete cascade,
foreign key(team_id) references teams(id)
on delete cascade,
primary key(id)
);
create table managing_stats
(
id integer not null,
w integer,
l integer,
primary key(id),
foreign key(id) references statistics(id)
on delete cascade
);
create table fielding_stats
(
id integer not null,
pos char(5),
po integer,
a integer,
e integer,
dp integer,
primary key(id),
foreign key(id) references statistics(id)
on delete cascade
);
create table batting_stats
(
id integer not null,
ab integer,
r integer,
h integer,
doubles integer,
triples integer,
hr integer,
rbi integer,
sb integer,
cs integer,
bb integer,
so integer,
sh integer,
sf integer,
ibb integer,
hbp integer,
primary key(id),
foreign key(id) references statistics(id)
on delete cascade
);
create table pitching_stats
(
id integer not null,
w integer,
l integer,
gs integer,
cg integer,
sh integer,
sv integer,
ip numeric(5,1),
h integer,
er integer,
hr integer,
bb integer,
so integer,
primary key(id),
foreign key(id) references statistics(id)
on delete cascade
);