Re: Millions of tables

From: Greg Spiegelberg <gspiegelberg(at)gmail(dot)com>
To: "pgsql-performa(dot)" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Millions of tables
Date: 2016-09-26 14:24:30
Message-ID: CAEtnbpUu1EiOAA4Awt4Ti8XxJNZG+inOfwv3wkqw2Z-j1o147A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Sun, Sep 25, 2016 at 8:50 PM, Greg Spiegelberg <gspiegelberg(at)gmail(dot)com>
wrote:

> Hey all,
>
> Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time
> has said not to have millions of tables. I too have long believed it until
> recently.
>
> AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for
> PGDATA. Over the weekend, I created 8M tables with 16M indexes on those
> tables. Table creation initially took 0.018031 secs, average 0.027467 and
> after tossing out outliers (qty 5) the maximum creation time found was
> 0.66139 seconds. Total time 30 hours, 31 minutes and 8.435049 seconds.
> Tables were created by a single process. Do note that table creation is
> done via plpgsql function as there are other housekeeping tasks necessary
> though minimal.
>
> No system tuning but here is a list of PostgreSQL knobs and switches:
> shared_buffers = 2GB
> work_mem = 48 MB
> max_stack_depth = 4 MB
> synchronous_commit = off
> effective_cache_size = 200 GB
> pg_xlog is on it's own file system
>
> There are some still obvious problems. General DBA functions such as
> VACUUM and ANALYZE should not be done. Each will run forever and cause
> much grief. Backups are problematic in the traditional pg_dump and PITR
> space. Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing
> it in my test case) are no-no's. A system or database crash could take
> potentially hours to days to recover. There are likely other issues ahead.
>
> You may wonder, "why is Greg attempting such a thing?" I looked at
> DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it,
> it's antiquated and don't get me started on "Hadoop". I looked at many
> others and ultimately the recommended use of each vendor was to have one
> table for all data. That overcomes the millions of tables problem, right?
>
> Problem with the "one big table" solution is I anticipate 1,200 trillion
> records. Random access is expected and the customer expects <30ms reads
> for a single record fetch.
>
> No data is loaded... yet Table and index creation only. I am interested
> in the opinions of all including tests I may perform. If you had this
> setup, what would you capture / analyze? I have a job running preparing
> data. I did this on a much smaller scale (50k tables) and data load via
> function allowed close to 6,000 records/second. The schema has been
> simplified since and last test reach just over 20,000 records/second with
> 300k tables.
>
> I'm not looking for alternatives yet but input to my test. Takers?
>
> I can't promise immediate feedback but will do my best to respond with
> results.
>
> TIA,
> -Greg
>

I've gotten more responses than anticipated and have answered some
questions and gotten some insight but my challenge again is what should I
capture along the way to prove or disprove this storage pattern?
Alternatives to the storage pattern aside, I need ideas to test rig,
capture metrics and suggestions to tune it.

In the next 24 hours, I will be sending ~1 trillion records to the test
database. Because of time to set up, I'd rather have things set up
properly the first go.

Thanks!
-Greg

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message dbyzaa@163.com 2016-09-26 15:39:11 temporary table vs array performance
Previous Message Stuart Bishop 2016-09-26 14:21:22 Re: Millions of tables