Re: Storing thousands of csv files in postgresql

From: Steve Midgley <science(at)misuse(dot)org>
To: Ion Alberdi <ion(dot)alberdi(at)pricemoov(dot)com>
Cc: pgsql-sql <pgsql-sql(at)lists(dot)postgresql(dot)org>
Subject: Re: Storing thousands of csv files in postgresql
Date: 2022-02-15 16:20:07
Message-ID: CAJexoSL-RADZAFFtW6NgGNg+SiomNwvcApWpogwuW3EsdGHJXA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

On Tue, Feb 15, 2022 at 12:15 AM Ion Alberdi <ion(dot)alberdi(at)pricemoov(dot)com>
wrote:

> Hello to all,
>
> One of the use cases we need to implement requires
> storing and query-ing thousands (and more as the product grows) of csv
> files
> that have different schema-s (by schema we mean column names and their
> type).
>
> These csv would then need to be maintained with operations like:
> - add column,
> - add row,
> - delete row,
> - read: filter/sort/paginate,
> - write: edit column values.
>
> Let's assume that we store the definition of each schema in a dedicated
> table,
> with the schema defined in a json column. With this schema we'll be able
> translate the read/write/update queries to these imported csv files into
> related SQL queries.
>
> The remaining question is how to store the data of each file in the DB.
>
> As suggested by https://www.postgresql.org/docs/10/sql-copy.html there is
> a way to import a csv in its own table. By using this approach for each
> csv-s we see:
>
> Pros:
> - All postgresql types available:
> https://www.postgresql.org/docs/9.5/datatype.html,
> - Constraints on columns, among others unicity constraints,
> that makes the DB guarantee rows will not duplicated (relevant to the
> add row use case),
> - Debuggability: enables using standard SQL to browse csv data,
> - Can reuse existing libraries to generate dynamic SQL queries [1]
>
> Cons:
> - Need to have as many tables as different schemas.
>
> Another solution could consist of implementing a document store in
> postgresql,
> by storing all columns of a row in a single jsonb column.
>
> Pros:
> - Single table to store all different imported csv-s.
>
> Cons:
> - Less types available
> https://www.postgresql.org/docs/9.4/datatype-json.html,
> - No constraint on columns, (no unicity or data validation constraints
> that should be delegated to the application),
> - Ramp-up on json* functions, (and I wonder whether there are libraries
> to safely generate dynamic SQL queries on json columns),
> (- Debuggability: this is not such a big con as json_to_record enables
> going back to a standard SQL experience)
>
> Based on this first pro/con list, we're wondering about the scalability
> limits faced by postgresql instances getting more tables in a given DB.
>
> Browsing the web, we saw two main issues:
> - One related to the OS "you may see some performance degradation
> associated
> with databases containing many tables. PostgreSQL may use a large number
> of
> files for storing the table data, and performance may suffer if the
> operating
> system does not cope well with many files in a single directory." [1]
> - Related to that, the fact that some operations like autovacuum are O(N)
> on number of tables [3]
>
> On the other hand, reading timescaledb's architecture
> https://docs.timescale.com/timescaledb/latest/overview/core-concepts/hypertables-and-chunks/#partitioning-in-hypertables-with-chunks
> "Each chunk is implemented using a standard database table."
> it seems that their platform took such a direction, which may have proved
> the scalability of such an approach.
>
> My question is thus the following:
> how many of such tables can a single postgresql instance handle without
> trouble [4]?
>
> Any challenge/addition to the pro/cons list described above would be very
> welcome too.
>
>
Given that no matter what answer the community can give you about the
number of tables per DB is reasonable, if your project is successful,
you'll probably exceed that limit eventually. Why not plan for federation
at the start (a little, at low cost) by including the PG server URL and DB
name where the CSV is stored in your CSV table schema store? That way, for
now, you just store CSVs in the current PG server/DB, and should it get
overwhelmed, it's relatively easy to just point accessors to a different
server and/or DB in the future for some CSV resources? The main upfront
increased cost is that you'll need to solve for credential management for
the various PG servers. If you're in AWS, "Secrets Manager" would work -
but there are lots of equivalent solutions out there.

FWIW, I think your analysis of the pros and cons of tables vs documents is
excellent but slightly incomplete. In my experience with document DB, I
only postpone all the downsides, in order to get the immediate benefits
(kind of like a "sugar high"). Eventually you have to solve for everything
you solve for with the table-based solution. You just don't have to solve
for it upfront, like in the table approach. And, at least on the project
where I got bit by a document db architecture, it's a lot harder to solve
for many of these problems when you solve for them later in your project,
so it's better just to build using structured tables up front for a project
with meaningful structures and lots of data.

Steve

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Ion Alberdi 2022-02-15 19:38:15 Re: Storing thousands of csv files in postgresql
Previous Message Karsten Hilbert 2022-02-15 08:37:36 Aw: Re: Re: Turn a json column into a table