Varchar vs foreign key vs enumerator - table and index size

From: Łukasz Walkowski <lukasz(dot)walkowski(at)homplex(dot)pl>
To: pgsql-performance(at)postgresql(dot)org
Subject: Varchar vs foreign key vs enumerator - table and index size
Date: 2013-08-31 13:35:58
Message-ID: 3AAC8AFE-E908-4AF7-ACDE-F9461B4A62D1@homplex.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi,
This is my first post on this group so welcome everyone! Currently I'm working on optimizing a quite simple database used to store events from one website. Every event is a set of data describing user behaviour. The main table that stores all events is built using schema:

Column | Type | Modifiers
-----------------+-----------------------------+-----------
id | bigint | not null
browser | character varying(255) |
created | timestamp without time zone |
eventsource | character varying(255) |
eventtype | character varying(255) |
ipaddress | character varying(255) |
objectid | bigint |
sessionid | character varying(255) |
shopids | integer[] |
source | character varying(255) |
sourceid | bigint |
supplierid | bigint |
cookieuuid | uuid |
serializeddata | bytea |
devicetype | character varying(255) |
operatingsystem | character varying(255) |

It was a quick project to play with EclipseLink, Hibernate and some Jersey Rest services, so isn't perfect. However the database became quite usefull and we decided to optimize this table as it grew quite large (128GB right now without indexes, about 630M records). There is only primary key index on this table. Here is the list of changes that I'd like to make to the table (some of them should be done from the scratch):

1. Changing ipaddress from varchar to inet - this should save some space and lower the size of potential index.

2. Changing id for some composite id with created contained in it.

3. And this part is most interesting for me. Columns browser, eventsource, eventtype, devicetype, operatingsystem contain a small pool of strings - for example for devicetype this is set to Computer, Mobile, Tablet or Unknown. Browser is set to normalized browser name. In every case I can store those data using one of 3 different methods:

- store as varchar as it is now - nice and easy, but index on those columns is quite big and I think storing many of similar strings is waste of space.

- store only id's and join external tables as needed, for example for browsers I only need smallint key, as there is a limited number of browsers. The column browser becomes smallint and we have additional table with two columns (id, browser varchar). This should save some space on event table, but if I want name of the browser in some report I need to join tables. Second thing - on every insert there is constraint that is checked for this field and this can affect performance. I was thinking about the same strategy for the remaining fields - this would give me 5 additional tables and 5 additional constraints on event table. Browser table will have about ~100 records, eventtype and eventsource will have about 8-12 records each, devicetype - 4 records, operatingsystem - didn't really check this one, but I think something around 100 like browser.

- introduce enumerator type for each of the column and store those values as enumerator. This one should be the most space efficient, but it will be problematic in case of changing column values like browser or operatingsystem as altering enumerator isn't that simple.

For browser average text length is 19 characters, for eventsource and eventtype eventsource average text lenght is 24 characters. Database encoding is set to UTF8.

My question is - what is estimated difference in table size between those 3 variants of storing columns? In theory third one should give me the smallest database and index size but is the most problematic from all of the above.

Lukasz Walkowski

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Kevin Grittner 2013-08-31 14:44:01 Re: How clustering for scale out works in PostgreSQL
Previous Message Tom Lane 2013-08-30 14:00:28 Re: Query plan change with multiple elements in IN clause