Re: Storing thousands of csv files in postgresql

From: Ion Alberdi <ion(dot)alberdi(at)pricemoov(dot)com>
To: Steve Midgley <science(at)misuse(dot)org>
Cc: pgsql-sql <pgsql-sql(at)lists(dot)postgresql(dot)org>
Subject: Re: Storing thousands of csv files in postgresql
Date: 2022-02-15 21:12:28
Message-ID: CANbgw4CO5YYT7OmuvFDWAmTXsmW_khbKFnr6s1h7+6=aywX1gg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

>I don't think you need a "federated" postgres network like Citus at all -
I think this solves a different use case. For your design problem, I think
that having a bunch of independent Pg servers would be fine - as long as
you don't need to run searches across CSV tables stored across different
databases (in which case you do need index/search federation of some kind).
Indeed

>I think if I were dealing with less than 10k CSV files (and therefore Pg
tables), I might use Pg, and if I were dealing with 10k+ files, I'd start
looking at file systems + Presto. But that's a WAG.
Got it, this setup would require a heavier dev investment compared to using
multiple pg instances though.

Thanks for these additional insights!

Le mar. 15 févr. 2022 à 21:58, Steve Midgley <science(at)misuse(dot)org> a écrit :

>
>
> On Tue, Feb 15, 2022 at 11:38 AM Ion Alberdi <ion(dot)alberdi(at)pricemoov(dot)com>
> wrote:
>
>> Thanks for these precious insights Steve!
>> >Given that no matter what answer the community can give you about the
>> number of tables per DB is reasonable, if your project is successful,
>> you'll probably exceed that limit eventually.
>> Indeed!
>>
>> >Why not plan for federation at the start (a little, at low cost) by
>> including the PG server URL and DB name where the CSV is stored in your CSV
>> table schema store?
>> So far I'd hope that https://www.citusdata.com/ would have features to
>> do so. Reading the docs, they do not seem to provide such a federation
>> though,
>> I'll send them an email to be sure. Thanks again!
>>
>> Le mar. 15 févr. 2022 à 17:20, Steve Midgley <science(at)misuse(dot)org> a
>> écrit :
>>
>>>
>>>
>>> On Tue, Feb 15, 2022 at 12:15 AM Ion Alberdi <ion(dot)alberdi(at)pricemoov(dot)com>
>>> wrote:
>>>
>>>> Hello to all,
>>>>
>>>> One of the use cases we need to implement requires
>>>> storing and query-ing thousands (and more as the product grows) of csv
>>>> files
>>>> that have different schema-s (by schema we mean column names and their
>>>> type).
>>>>
>>>> These csv would then need to be maintained with operations like:
>>>> - add column,
>>>> - add row,
>>>> - delete row,
>>>> - read: filter/sort/paginate,
>>>> - write: edit column values.
>>>>
>>>> Let's assume that we store the definition of each schema in a dedicated
>>>> table,
>>>> with the schema defined in a json column. With this schema we'll be
>>>> able translate the read/write/update queries to these imported csv files
>>>> into related SQL queries.
>>>>
>>>> The remaining question is how to store the data of each file in the DB.
>>>>
>>>> As suggested by https://www.postgresql.org/docs/10/sql-copy.html there
>>>> is a way to import a csv in its own table. By using this approach for each
>>>> csv-s we see:
>>>>
>>>> Pros:
>>>> - All postgresql types available:
>>>> https://www.postgresql.org/docs/9.5/datatype.html,
>>>> - Constraints on columns, among others unicity constraints,
>>>> that makes the DB guarantee rows will not duplicated (relevant to the
>>>> add row use case),
>>>> - Debuggability: enables using standard SQL to browse csv data,
>>>> - Can reuse existing libraries to generate dynamic SQL queries [1]
>>>>
>>>> Cons:
>>>> - Need to have as many tables as different schemas.
>>>>
>>>> Another solution could consist of implementing a document store in
>>>> postgresql,
>>>> by storing all columns of a row in a single jsonb column.
>>>>
>>>> Pros:
>>>> - Single table to store all different imported csv-s.
>>>>
>>>> Cons:
>>>> - Less types available
>>>> https://www.postgresql.org/docs/9.4/datatype-json.html,
>>>> - No constraint on columns, (no unicity or data validation constraints
>>>> that should be delegated to the application),
>>>> - Ramp-up on json* functions, (and I wonder whether there are libraries
>>>> to safely generate dynamic SQL queries on json columns),
>>>> (- Debuggability: this is not such a big con as json_to_record enables
>>>> going back to a standard SQL experience)
>>>>
>>>> Based on this first pro/con list, we're wondering about the scalability
>>>> limits faced by postgresql instances getting more tables in a given DB.
>>>>
>>>> Browsing the web, we saw two main issues:
>>>> - One related to the OS "you may see some performance degradation
>>>> associated
>>>> with databases containing many tables. PostgreSQL may use a large
>>>> number of
>>>> files for storing the table data, and performance may suffer if the
>>>> operating
>>>> system does not cope well with many files in a single directory." [1]
>>>> - Related to that, the fact that some operations like autovacuum are
>>>> O(N) on number of tables [3]
>>>>
>>>> On the other hand, reading timescaledb's architecture
>>>> https://docs.timescale.com/timescaledb/latest/overview/core-concepts/hypertables-and-chunks/#partitioning-in-hypertables-with-chunks
>>>> "Each chunk is implemented using a standard database table."
>>>> it seems that their platform took such a direction, which may have
>>>> proved the scalability of such an approach.
>>>>
>>>> My question is thus the following:
>>>> how many of such tables can a single postgresql instance handle without
>>>> trouble [4]?
>>>>
>>>> Any challenge/addition to the pro/cons list described above would be
>>>> very welcome too.
>>>>
>>>>
>>> Given that no matter what answer the community can give you about the
>>> number of tables per DB is reasonable, if your project is successful,
>>> you'll probably exceed that limit eventually. Why not plan for federation
>>> at the start (a little, at low cost) by including the PG server URL and DB
>>> name where the CSV is stored in your CSV table schema store? That way, for
>>> now, you just store CSVs in the current PG server/DB, and should it get
>>> overwhelmed, it's relatively easy to just point accessors to a different
>>> server and/or DB in the future for some CSV resources? The main upfront
>>> increased cost is that you'll need to solve for credential management for
>>> the various PG servers. If you're in AWS, "Secrets Manager" would work -
>>> but there are lots of equivalent solutions out there.
>>>
>>> FWIW, I think your analysis of the pros and cons of tables vs documents
>>> is excellent but slightly incomplete. In my experience with document DB, I
>>> only postpone all the downsides, in order to get the immediate benefits
>>> (kind of like a "sugar high"). Eventually you have to solve for everything
>>> you solve for with the table-based solution. You just don't have to solve
>>> for it upfront, like in the table approach. And, at least on the project
>>> where I got bit by a document db architecture, it's a lot harder to solve
>>> for many of these problems when you solve for them later in your project,
>>> so it's better just to build using structured tables up front for a project
>>> with meaningful structures and lots of data.
>>>
>>>
> I don't think you need a "federated" postgres network like Citus at all -
> I think this solves a different use case. For your design problem, I think
> that having a bunch of independent Pg servers would be fine - as long as
> you don't need to run searches across CSV tables stored across different
> databases (in which case you do need index/search federation of some kind).
>
> Regarding Erik Brandsberg's point about XFS, I think this is a useful
> alternative approach, if I understand the idea. Instead of storing your CSV
> files in Postgres, just store them as CSV files on the file system. You can
> still store the schemas in Pg, but each schema would just point to a file
> in the file system and you'd manipulate the files in the filesystem using
> whatever language is appropriate (I find ruby to be excellent for managing
> CSV files). If you need to index those files to run searches against them,
> I'd direct your attention to https://prestodb.io/ which is the core
> technology that runs Amazon Athena. This allows you to search CSV files
> with various schema (among other data bindings). So you might find that Pg
> as your schema storage, XFS (or any modern FS) to store large numbers of
> CSV files, and Presto/Athena to index/search those files, along with some
> CSV management language (like Ruby or something even higher level) to
> manage the data.
>
> I think if I were dealing with less than 10k CSV files (and therefore Pg
> tables), I might use Pg, and if I were dealing with 10k+ files, I'd start
> looking at file systems + Presto. But that's a WAG.
>
> Steve
>
>
>
> On Tue, Feb 15, 2022 at 11:38 AM Ion Alberdi <ion(dot)alberdi(at)pricemoov(dot)com>
> wrote:
>
>> Thanks for these precious insights Steve!
>> >Given that no matter what answer the community can give you about the
>> number of tables per DB is reasonable, if your project is successful,
>> you'll probably exceed that limit eventually.
>> Indeed!
>>
>> >Why not plan for federation at the start (a little, at low cost) by
>> including the PG server URL and DB name where the CSV is stored in your CSV
>> table schema store?
>> So far I'd hope that https://www.citusdata.com/ would have features to
>> do so. Reading the docs, they do not seem to provide such a federation
>> though,
>> I'll send them an email to be sure. Thanks again!
>>
>> Le mar. 15 févr. 2022 à 17:20, Steve Midgley <science(at)misuse(dot)org> a
>> écrit :
>>
>>>
>>>
>>> On Tue, Feb 15, 2022 at 12:15 AM Ion Alberdi <ion(dot)alberdi(at)pricemoov(dot)com>
>>> wrote:
>>>
>>>> Hello to all,
>>>>
>>>> One of the use cases we need to implement requires
>>>> storing and query-ing thousands (and more as the product grows) of csv
>>>> files
>>>> that have different schema-s (by schema we mean column names and their
>>>> type).
>>>>
>>>> These csv would then need to be maintained with operations like:
>>>> - add column,
>>>> - add row,
>>>> - delete row,
>>>> - read: filter/sort/paginate,
>>>> - write: edit column values.
>>>>
>>>> Let's assume that we store the definition of each schema in a dedicated
>>>> table,
>>>> with the schema defined in a json column. With this schema we'll be
>>>> able translate the read/write/update queries to these imported csv files
>>>> into related SQL queries.
>>>>
>>>> The remaining question is how to store the data of each file in the DB.
>>>>
>>>> As suggested by https://www.postgresql.org/docs/10/sql-copy.html there
>>>> is a way to import a csv in its own table. By using this approach for each
>>>> csv-s we see:
>>>>
>>>> Pros:
>>>> - All postgresql types available:
>>>> https://www.postgresql.org/docs/9.5/datatype.html,
>>>> - Constraints on columns, among others unicity constraints,
>>>> that makes the DB guarantee rows will not duplicated (relevant to the
>>>> add row use case),
>>>> - Debuggability: enables using standard SQL to browse csv data,
>>>> - Can reuse existing libraries to generate dynamic SQL queries [1]
>>>>
>>>> Cons:
>>>> - Need to have as many tables as different schemas.
>>>>
>>>> Another solution could consist of implementing a document store in
>>>> postgresql,
>>>> by storing all columns of a row in a single jsonb column.
>>>>
>>>> Pros:
>>>> - Single table to store all different imported csv-s.
>>>>
>>>> Cons:
>>>> - Less types available
>>>> https://www.postgresql.org/docs/9.4/datatype-json.html,
>>>> - No constraint on columns, (no unicity or data validation constraints
>>>> that should be delegated to the application),
>>>> - Ramp-up on json* functions, (and I wonder whether there are libraries
>>>> to safely generate dynamic SQL queries on json columns),
>>>> (- Debuggability: this is not such a big con as json_to_record enables
>>>> going back to a standard SQL experience)
>>>>
>>>> Based on this first pro/con list, we're wondering about the scalability
>>>> limits faced by postgresql instances getting more tables in a given DB.
>>>>
>>>> Browsing the web, we saw two main issues:
>>>> - One related to the OS "you may see some performance degradation
>>>> associated
>>>> with databases containing many tables. PostgreSQL may use a large
>>>> number of
>>>> files for storing the table data, and performance may suffer if the
>>>> operating
>>>> system does not cope well with many files in a single directory." [1]
>>>> - Related to that, the fact that some operations like autovacuum are
>>>> O(N) on number of tables [3]
>>>>
>>>> On the other hand, reading timescaledb's architecture
>>>> https://docs.timescale.com/timescaledb/latest/overview/core-concepts/hypertables-and-chunks/#partitioning-in-hypertables-with-chunks
>>>> "Each chunk is implemented using a standard database table."
>>>> it seems that their platform took such a direction, which may have
>>>> proved the scalability of such an approach.
>>>>
>>>> My question is thus the following:
>>>> how many of such tables can a single postgresql instance handle without
>>>> trouble [4]?
>>>>
>>>> Any challenge/addition to the pro/cons list described above would be
>>>> very welcome too.
>>>>
>>>>
>>> Given that no matter what answer the community can give you about the
>>> number of tables per DB is reasonable, if your project is successful,
>>> you'll probably exceed that limit eventually. Why not plan for federation
>>> at the start (a little, at low cost) by including the PG server URL and DB
>>> name where the CSV is stored in your CSV table schema store? That way, for
>>> now, you just store CSVs in the current PG server/DB, and should it get
>>> overwhelmed, it's relatively easy to just point accessors to a different
>>> server and/or DB in the future for some CSV resources? The main upfront
>>> increased cost is that you'll need to solve for credential management for
>>> the various PG servers. If you're in AWS, "Secrets Manager" would work -
>>> but there are lots of equivalent solutions out there.
>>>
>>> FWIW, I think your analysis of the pros and cons of tables vs documents
>>> is excellent but slightly incomplete. In my experience with document DB, I
>>> only postpone all the downsides, in order to get the immediate benefits
>>> (kind of like a "sugar high"). Eventually you have to solve for everything
>>> you solve for with the table-based solution. You just don't have to solve
>>> for it upfront, like in the table approach. And, at least on the project
>>> where I got bit by a document db architecture, it's a lot harder to solve
>>> for many of these problems when you solve for them later in your project,
>>> so it's better just to build using structured tables up front for a project
>>> with meaningful structures and lots of data.
>>>
>>> Steve
>>>
>>

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Rob Sargent 2022-02-15 21:13:30 Re: Storing thousands of csv files in postgresql
Previous Message Steve Midgley 2022-02-15 20:58:34 Re: Storing thousands of csv files in postgresql