Re: PGSQL 9.3 - billion rows

From: Felipe Santos <felipepts(at)gmail(dot)com>
To: Nicolas Paris <niparisco(at)gmail(dot)com>
Cc: pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: PGSQL 9.3 - billion rows
Date: 2014-07-07 14:27:51
Message-ID: CAPYcRiUtY4pDx2m-sS0prgS2DKQut2c10Kci86TfxMJgL_g2FQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi Nicolas,

I do believe Postgresql can handle that.

I've worked with tables that have 2 millions rows per day, which give us an
average of 700 mi/year.

It's hard to say how much hardware power you will need, but I would say
test it with a server in the cloud, since servers in the cloud are usually
easily to resize to your needs (both up and down).

Beside that, take a look at this link to fine tune your settings:
https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

My final words are about the table itselft. I've used to create partitions
for such large tables. The partitions were by day (I had a "created_date"
column), because that was the most used filtering field used by the people
that queried the table. Using partitions make Postgresql look at only the
subset of data that is being queried, thus increasing querying performance.

If you can do that, do it. But be sure you are partitioning the right
column. Creating partitions that are different from the most part of the
querying filters may impact the query performance negatively.

Good luck!

2014-07-07 10:59 GMT-03:00 Nicolas Paris <niparisco(at)gmail(dot)com>:

> ​​
> ​​
> ​​
> ​​
> ​Hello,
>
> I have a fact table ( table and indexes are bellow ) that will probably
> get arround 2 billion rows.
>
> - Can postgresql support such table (this table is the fact table of a
> datamart -> many join query with dimensions tables) ?
> - If yes, I would like to test (say insert 2 billion test rows), what
> serveur configuration do I need ? How much RAM ?
> - If not, would it be better to think about a cluster or other ?
> - (Have you any idea to optimize this table ?)
>
> Thanks a lot !
>
>
> CREATE TABLE observation_fact
> (
> encounter_num integer NOT NULL,
> patient_num integer NOT NULL,
> concept_cd character varying(50) NOT NULL,
> provider_id character varying(50) NOT NULL,
> start_date timestamp without time zone NOT NULL,
> modifier_cd character varying(100) NOT NULL DEFAULT '@'::character
> varying,
> instance_num integer NOT NULL DEFAULT 1,
> valtype_cd character varying(50),
> tval_char character varying(255),
> nval_num numeric(18,5),
> valueflag_cd character varying(50),
> quantity_num numeric(18,5),
> units_cd character varying(50),
> end_date timestamp without time zone,
> location_cd character varying(50),
> observation_blob text,
> confidence_num numeric(18,5),
> update_date timestamp without time zone,
> download_date timestamp without time zone,
> import_date timestamp without time zone,
> sourcesystem_cd character varying(50),
> upload_id integer,
> text_search_index serial NOT NULL,
> CONSTRAINT observation_fact_pk PRIMARY KEY (patient_num, concept_cd,
> modifier_cd, start_date, encounter_num, instance_num, provider_id)
> )
> WITH (
> OIDS=FALSE
> );
>
>
> CREATE INDEX of_idx_allobservation_fact
> ON i2b2databeta.observation_fact
> USING btree
> (patient_num, encounter_num, concept_cd COLLATE pg_catalog."default",
> start_date, provider_id COLLATE pg_catalog."default", modifier_cd COLLATE
> pg_catalog."default", instance_num, valtype_cd COLLATE
> pg_catalog."default", tval_char COLLATE pg_catalog."default", nval_num,
> valueflag_cd COLLATE pg_catalog."default", quantity_num, units_cd COLLATE
> pg_catalog."default", end_date, location_cd COLLATE pg_catalog."default",
> confidence_num);
>
>
> CREATE INDEX of_idx_clusteredconcept
> ON i2b2databeta.observation_fact
> USING btree
> (concept_cd COLLATE pg_catalog."default");
>
>
> CREATE INDEX of_idx_encounter_patient
> ON i2b2databeta.observation_fact
> USING btree
> (encounter_num, patient_num, instance_num);
>
>
> CREATE INDEX of_idx_modifier
> ON i2b2databeta.observation_fact
> USING btree
> (modifier_cd COLLATE pg_catalog."default");
>
> CREATE INDEX of_idx_sourcesystem_cd
> ON i2b2databeta.observation_fact
> USING btree
> (sourcesystem_cd COLLATE pg_catalog."default");
>
>
> CREATE INDEX of_idx_start_date
> ON i2b2databeta.observation_fact
> USING btree
> (start_date, patient_num);
>
>
> CREATE INDEX of_idx_uploadid
> ON i2b2databeta.observation_fact
> USING btree
> (upload_id);
>
>
> CREATE UNIQUE INDEX of_text_search_unique
> ON i2b2databeta.observation_fact
> USING btree
> (text_search_index);
> ​
>
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message piuschan 2014-07-07 18:39:20 Re: stored procedure suddenly runs slowly in HOT STANDBY but fast in primary
Previous Message Nicolas Paris 2014-07-07 13:59:59 PGSQL 9.3 - billion rows