Quick Links

Netflix Prize data

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	pg(at)mohawksoft(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Netflix Prize data
Date:	2006-10-04 20:43:42
Message-ID:	18350.24.91.171.78.1159994622.squirrel@mail.mohawksoft.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

markw(at)snoopy:~/netflix$ time psql netflix -c "select count(*) from ratings"
count
-----------
100480507
(1 row)

real 2m6.270s
user 0m0.004s
sys 0m0.005s

The one thing I notice is that it is REAL slow. I know it is, in fact, 100
million records, but I don't think PostgreSQL is usually slow like this.
I'm going to check with some other machines to see if there is a problem
with my test machine or if something is wierd about PostgreSQL and large
numbers of rows.

I tried to cluster the data along a particular index but had to cancel it
after 3 hours.

I'm using 8.1.4. The "rdate" field looks something like: "2005-09-06" So,
the raw data is 23 bytes, the date string will probably be rounded up to
12 bytes, that's 24 bytes per row of data. What is the overhead per
variable? per row?

Is there any advantage to using "varchar(10)" over "text" ?

Responses

Re: Netflix Prize data at 2006-10-04 21:00:00 from Luke Lonergan
Re: Netflix Prize data at 2006-10-04 21:00:52 from Tom Lane
Re: Netflix Prize data at 2006-10-04 21:46:18 from Gregory Stark
Re: Netflix Prize data at 2006-10-04 22:34:52 from Greg Sabino Mullane
Re: Netflix Prize data at 2006-10-05 08:35:01 from Heikki Linnakangas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Luke Lonergan	2006-10-04 21:00:00	Re: Netflix Prize data
Previous Message	Bruce Momjian	2006-10-04 20:41:44	Re: pgindent has been run