Re: Table Design question for gurus (without going to "NoSQL")...

From: Phoenix Kiula <phoenix(dot)kiula(at)gmail(dot)com>
To: PG-General Mailing List <pgsql-general(at)postgresql(dot)org>
Subject: Re: Table Design question for gurus (without going to "NoSQL")...
Date: 2011-11-20 15:12:59
Message-ID: CAFWfU=um3eF5vsSKQ=qUP9BbXAPMzcx-bmvWfTYNFhLNEX49KA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sun, Nov 20, 2011 at 9:33 PM, Phoenix Kiula <phoenix(dot)kiula(at)gmail(dot)com> wrote:
> Hi.
>
> Want to start another thread, loosely related to the performance
> problems thread I have going.
>
> Need some DB design guidance from the gurus here.
>
> My big table now has about 70 million rows, with the following columns:
>
>
>  alias           | character varying(35)
>  url             | text
>  modify_date     | timestamp without time zone
>  ip              | bigint
>
>
> For each IP address (user of my application) I want to have a unique
> URL. So I used to have a UNIQUE constraint on IP, URL. But the index
> based on this became huge, as some URLs are gigantic. so I introduced
> an md5 of the URL:
>
>
>  url_md5             | varchar(32)
>
>
> I now have two scenarios:
>
> 1. To have an index (unique?) on "(ip, url_md5)"
>
> 2. To not have an index on just the "ip". This way a query that tries
> to match   "...WHERE ip = 999 AND url_md5 = '<md5 here>'..." will
> still look only at the ip bit of the index, then refine it with the
> url_md5.
>
> The good thing about #2 is the size of index remains very small with
> only a bigint field (ip) being indexed.
>
> The bad thing about #2 is that each query of "...WHERE ip = 999 AND
> url_md5 = '<md5 here>'..."  will have to refine the indexed IP. If one
> IP address has put in a lot of URLs, then this becomes a bit slow. As
> is now happening, where I have users who have over 1 million URLs
> each!
>
> Questions:
>
> 1. Instead of md5, is there any integer hashing algorithm that will
> allow me to have a bigint column and save a lot hopefully in both
> storage space and speed?  (Some very useful points mentioned here:
> http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer
> )
>
> 2. If I do go with the above scenario #1 of a joint index, is there
> any way I can save space and maintain speed? Partitioning etc are out
> of the question.
>
> With a growing web database, I am sure many people face this
> situation. Are nosql type databases the only sane solution to such
> massive volumes and throughput expectations (e.g., CouchDb's MemBase)?
>
> Many thanks for any ideas or pointers!
>

I thought of adding a bigserial (serial8) column instead of
varchar(32) for the md5. But postgresql tells me that:

--
ERROR: type "bigserial" does not exist
--

Why is this? Why can't I create a column with this "type"? Whats the
current syntax?

Thanks.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Gregg Jaskiewicz 2011-11-20 15:16:20 Re: Table Design question for gurus (without going to "NoSQL")...
Previous Message Rob Sargentg 2011-11-20 13:50:09 9.1.1 build failure : postgres link fails