Re: Millions of tables

From: Greg Spiegelberg <gspiegelberg(at)gmail(dot)com>
To: Stuart Bishop <stuart(at)stuartbishop(dot)net>
Cc: "pgsql-performa(dot)" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Millions of tables
Date: 2016-09-26 13:51:10
Message-ID: CAEtnbpVwn9qLuqWF3M5jE28v0AFpOHtoKowEr=uYgHovV-QncQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Mon, Sep 26, 2016 at 3:43 AM, Stuart Bishop <stuart(at)stuartbishop(dot)net>
wrote:

> On 26 September 2016 at 11:19, Greg Spiegelberg <gspiegelberg(at)gmail(dot)com>
> wrote:
>
>> I did look at PostgresXL and CitusDB. Both are admirable however neither
>> could support the need to read a random record consistently under 30ms.
>> It's a similar problem Cassandra and others have: network latency. At this
>> scale, to provide the ability to access any given record amongst trillions
>> it is imperative to know precisely where it is stored (system & database)
>> and read a relatively small index. I have other requirements that prohibit
>> use of any technology that is eventually consistent.
>>
>> I liken the problem to fishing. To find a particular fish of length,
>> size, color &c in a data lake you must accept the possibility of scanning
>> the entire lake. However, if all fish were in barrels where each barrel
>> had a particular kind of fish of specific length, size, color &c then the
>> problem is far simpler.
>>
>> -Greg
>>
>
> My gut tells me that if you do solve the problem and get PostgreSQL (or
> anything) reading consistently at under 30ms with that many tables you will
> have solved one problem by creating another.
>
>
Exactly why I am exploring. What are the trade offs?

> You discounted Cassandra due to network latency, but are now trying a
> monolithic PostgreSQL setup. It might be worth trying a single node
> ScyllaDB or Cassandra deploy (no need for QUORUM or network overhead),
> perhaps using layered compaction so all your data gets broken out into
> 160MB chunks. And certainly wander over to the ScyllaDB mailing list, as
> they are very focused on performance problems like yours and should offer
> some insight even if a Cassandra style architecture cannot meet your
> requirements.
>
>
Cassandra performance, according to the experts I consulted, starts to fall
off once the stored dataset exceeds ~3 TB. Much too small for my use
case. Again, I do have other reasons for not using Cassandra and others
namely deduplication of information referenced by my millions of tables.
There are no guarantees in many outside of the RDBMS realm.

> An alternative if you exhaust or don't trust other options, use a foreign
> data wrapper to access your own custom storage. A single table at the PG
> level, you can shard the data yourself into 8 bazillion separate stores, in
> whatever structure suites your read and write operations (maybe reusing an
> embedded db engine, ordered flat file+log+index, whatever).
>
>
However even 8 bazillion FDW's may cause an "overflow" of relationships at
the loss of having an efficient storage engine acting more like a traffic
cop. In such a case, I would opt to put such logic in the app to directly
access the true storage over using FDW's.

-Greg

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Greg Spiegelberg 2016-09-26 13:53:16 Re: Millions of tables
Previous Message Mike Sofen 2016-09-26 13:05:01 Re: Millions of tables