Re: Storing thousands of csv files in postgresql

From: Erik Brandsberg <erik(at)heimdalldata(dot)com>
To: Ion Alberdi <ion(dot)alberdi(at)pricemoov(dot)com>
Cc: pgsql-sql <pgsql-sql(at)lists(dot)postgresql(dot)org>
Subject: Re: Storing thousands of csv files in postgresql
Date: 2022-02-15 20:42:58
Message-ID: CAFcck8EFY0R0hLHy8VPa7xYqre3EH_p13c0+dKb+cuzsqwWuvA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

I was just about to call out in a followup you may not have control over
the filesystem if you are in say RDS then. In this case, another
question--is there a core set of columns that will be used for all tables,
and then a variable set for each? It may make sense to use one table with
a "table id" column, and then the common tables, then just use json storage
for the variable columns. More information on the nature of the data may
help elicit a better answer however.

On Tue, Feb 15, 2022 at 3:40 PM Ion Alberdi <ion(dot)alberdi(at)pricemoov(dot)com>
wrote:

> >"What filesystem is best suited to having many files in the same
> directory" and let the technology deal with the problem. Postgres is just
> dependent on the filesystem for this behavior. And to that answer, I
> believe, is XFS.
> Given that we use AWS RDS instances, I don't think we have the option to
> choose the filesystem (at least there is no such info at
> https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/)
> Still, we'll keep that in mind thanks Erik!
>
>
>
> Le mar. 15 févr. 2022 à 21:11, Erik Brandsberg <erik(at)heimdalldata(dot)com> a
> écrit :
>
>> I'm going to challenge that the question is not one for Postgres, but you
>> should be asking "What filesystem is best suited to having many files in
>> the same directory" and let the technology deal with the problem. Postgres
>> is just dependent on the filesystem for this behavior. And to that answer,
>> I believe, is XFS.
>>
>> On Tue, Feb 15, 2022 at 3:15 AM Ion Alberdi <ion(dot)alberdi(at)pricemoov(dot)com>
>> wrote:
>>
>>> Hello to all,
>>>
>>> One of the use cases we need to implement requires
>>> storing and query-ing thousands (and more as the product grows) of csv
>>> files
>>> that have different schema-s (by schema we mean column names and their
>>> type).
>>>
>>> These csv would then need to be maintained with operations like:
>>> - add column,
>>> - add row,
>>> - delete row,
>>> - read: filter/sort/paginate,
>>> - write: edit column values.
>>>
>>> Let's assume that we store the definition of each schema in a dedicated
>>> table,
>>> with the schema defined in a json column. With this schema we'll be able
>>> translate the read/write/update queries to these imported csv files into
>>> related SQL queries.
>>>
>>> The remaining question is how to store the data of each file in the DB.
>>>
>>> As suggested by https://www.postgresql.org/docs/10/sql-copy.html there
>>> is a way to import a csv in its own table. By using this approach for each
>>> csv-s we see:
>>>
>>> Pros:
>>> - All postgresql types available:
>>> https://www.postgresql.org/docs/9.5/datatype.html,
>>> - Constraints on columns, among others unicity constraints,
>>> that makes the DB guarantee rows will not duplicated (relevant to the
>>> add row use case),
>>> - Debuggability: enables using standard SQL to browse csv data,
>>> - Can reuse existing libraries to generate dynamic SQL queries [1]
>>>
>>> Cons:
>>> - Need to have as many tables as different schemas.
>>>
>>> Another solution could consist of implementing a document store in
>>> postgresql,
>>> by storing all columns of a row in a single jsonb column.
>>>
>>> Pros:
>>> - Single table to store all different imported csv-s.
>>>
>>> Cons:
>>> - Less types available
>>> https://www.postgresql.org/docs/9.4/datatype-json.html,
>>> - No constraint on columns, (no unicity or data validation constraints
>>> that should be delegated to the application),
>>> - Ramp-up on json* functions, (and I wonder whether there are libraries
>>> to safely generate dynamic SQL queries on json columns),
>>> (- Debuggability: this is not such a big con as json_to_record enables
>>> going back to a standard SQL experience)
>>>
>>> Based on this first pro/con list, we're wondering about the scalability
>>> limits faced by postgresql instances getting more tables in a given DB.
>>>
>>> Browsing the web, we saw two main issues:
>>> - One related to the OS "you may see some performance degradation
>>> associated
>>> with databases containing many tables. PostgreSQL may use a large
>>> number of
>>> files for storing the table data, and performance may suffer if the
>>> operating
>>> system does not cope well with many files in a single directory." [1]
>>> - Related to that, the fact that some operations like autovacuum are
>>> O(N) on number of tables [3]
>>>
>>> On the other hand, reading timescaledb's architecture
>>> https://docs.timescale.com/timescaledb/latest/overview/core-concepts/hypertables-and-chunks/#partitioning-in-hypertables-with-chunks
>>> "Each chunk is implemented using a standard database table."
>>> it seems that their platform took such a direction, which may have
>>> proved the scalability of such an approach.
>>>
>>> My question is thus the following:
>>> how many of such tables can a single postgresql instance handle without
>>> trouble [4]?
>>>
>>> Any challenge/addition to the pro/cons list described above would be
>>> very welcome too.
>>>
>>> Best regards,
>>> Ion
>>>
>>> [1]: Like https://www.psycopg.org/docs/sql.html
>>> [2]:
>>> https://link.springer.com/content/pdf/bbm%3A978-1-4302-0018-5%2F1.pdf
>>> [3]:
>>> https://stackoverflow.com/questions/22395883/postgresql-what-is-the-maximum-number-of-tables-can-store-in-postgresql-databas
>>> [4]: We use RDS instances in AWS
>>>
>>

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Ion Alberdi 2022-02-15 20:43:42 Re: Storing thousands of csv files in postgresql
Previous Message Ion Alberdi 2022-02-15 20:40:20 Re: Storing thousands of csv files in postgresql