Re: Storing thousands of csv files in postgresql

From: Rob Sargent <robjsargent(at)gmail(dot)com>
To: pgsql-sql(at)lists(dot)postgresql(dot)org
Subject: Re: Storing thousands of csv files in postgresql
Date: 2022-02-15 21:13:30
Message-ID: 9abb8f15-001e-8aa8-d930-fe5af71f829c@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql


> I don't think you need a "federated" postgres network like Citus at
> all - I think this solves a different use case. For your design
> problem, I think that having a bunch of independent Pg servers would
> be fine - as long as you don't need to run searches across CSV tables
> stored across different databases (in which case you do need
> index/search federation of some kind).
>
> Regarding Erik Brandsberg's point about XFS, I think this is a useful
> alternative approach, if I understand the idea. Instead of storing
> your CSV files in Postgres, just store them as CSV files on the file
> system. You can still store the schemas in Pg, but each schema would
> just point to a file in the file system and you'd manipulate the files
> in the filesystem using whatever language is appropriate (I find ruby
> to be excellent for managing CSV files). If you need to index those
> files to run searches against them, I'd direct your attention to
> https://prestodb.io/ which is the core technology that runs Amazon
> Athena. This allows you to search CSV files with various schema (among
> other data bindings). So you might find that Pg as your schema
> storage, XFS (or any modern FS) to store large numbers of CSV files,
> and Presto/Athena to index/search those files, along with some CSV
> management language (like Ruby or something even higher level) to
> manage the data.
>
> I think if I were dealing with less than 10k CSV files (and therefore
> Pg tables), I might use Pg, and if I were dealing with 10k+ files, I'd
> start looking at file systems + Presto. But that's a WAG.
>
> Steve
>
>
I think the add/remove column requirement alone justifies NOT using
files.  The CSV approach will temp the system to handle some versioning
nonsense. Using tables also provides some protection against the
inevitable garbage data in the CSVs.

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Steve Midgley 2022-02-15 21:21:46 Re: Storing thousands of csv files in postgresql
Previous Message Ion Alberdi 2022-02-15 21:12:28 Re: Storing thousands of csv files in postgresql