Millions of tables

From: Greg Spiegelberg <gspiegelberg(at)gmail(dot)com>
To: "pgsql-performa(dot)" <pgsql-performance(at)postgresql(dot)org>
Subject: Millions of tables
Date: 2016-09-26 02:50:09
Message-ID: CAEtnbpVXi0ESh16KAeYkFOiFPhTrJ2JXps9PypuAGCYidWtcnA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hey all,

Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time
has said not to have millions of tables. I too have long believed it until
recently.

AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for
PGDATA. Over the weekend, I created 8M tables with 16M indexes on those
tables. Table creation initially took 0.018031 secs, average 0.027467 and
after tossing out outliers (qty 5) the maximum creation time found was
0.66139 seconds. Total time 30 hours, 31 minutes and 8.435049 seconds.
Tables were created by a single process. Do note that table creation is
done via plpgsql function as there are other housekeeping tasks necessary
though minimal.

No system tuning but here is a list of PostgreSQL knobs and switches:
shared_buffers = 2GB
work_mem = 48 MB
max_stack_depth = 4 MB
synchronous_commit = off
effective_cache_size = 200 GB
pg_xlog is on it's own file system

There are some still obvious problems. General DBA functions such as
VACUUM and ANALYZE should not be done. Each will run forever and cause
much grief. Backups are problematic in the traditional pg_dump and PITR
space. Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing
it in my test case) are no-no's. A system or database crash could take
potentially hours to days to recover. There are likely other issues ahead.

You may wonder, "why is Greg attempting such a thing?" I looked at
DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it,
it's antiquated and don't get me started on "Hadoop". I looked at many
others and ultimately the recommended use of each vendor was to have one
table for all data. That overcomes the millions of tables problem, right?

Problem with the "one big table" solution is I anticipate 1,200 trillion
records. Random access is expected and the customer expects <30ms reads
for a single record fetch.

No data is loaded... yet Table and index creation only. I am interested
in the opinions of all including tests I may perform. If you had this
setup, what would you capture / analyze? I have a job running preparing
data. I did this on a much smaller scale (50k tables) and data load via
function allowed close to 6,000 records/second. The schema has been
simplified since and last test reach just over 20,000 records/second with
300k tables.

I'm not looking for alternatives yet but input to my test. Takers?

I can't promise immediate feedback but will do my best to respond with
results.

TIA,
-Greg

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message julyanto SUTANDANG 2016-09-26 03:04:26 Re: Millions of tables
Previous Message Jeff Janes 2016-09-26 00:20:14 Re: Storing large documents - one table or partition by doc?