Quick Links

Re: Storing thousands of csv files in postgresql

From:	Ion Alberdi <ion(dot)alberdi(at)pricemoov(dot)com>
To:	Steve Midgley <science(at)misuse(dot)org>
Cc:	Rob Sargent <robjsargent(at)gmail(dot)com>, pgsql-sql <pgsql-sql(at)lists(dot)postgresql(dot)org>
Subject:	Re: Storing thousands of csv files in postgresql
Date:	2022-02-15 21:30:37
Message-ID:	CANbgw4Boi=XWmYqbFFUXfSBcFgytAaje2B9=LAhrhi+F2TLspw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-sql

>I missed that requirement - totally agree. Adding columns to file-based
CSVs is a huge PITA. Agree on your other point as well: just having the
right charset requirements in Pg can help trim or identify some of the bad
data that comes in from CSVs but the opportunities for all kinds of
efficient data validation recommends storing the CSV data in Pg.
+1

>I'll add that I've used the postgres COPY command to move CSV data into
tables so fast that I had to double take the rows/sec to believe it. I
actually shared a gist to demonstrate how to implement this using Ruby and
a low level postgres driver a few months ago:
https://gist.github.com/science/393907d4123c87ed767bc81e9dd5a7da Maybe
useful..
Thanks for the gist Steve!

Le mar. 15 févr. 2022 à 22:22, Steve Midgley <science(at)misuse(dot)org> a écrit :

>
>
> On Tue, Feb 15, 2022 at 1:13 PM Rob Sargent <robjsargent(at)gmail(dot)com> wrote:
>
>>
>> I don't think you need a "federated" postgres network like Citus at all -
>> I think this solves a different use case. For your design problem, I think
>> that having a bunch of independent Pg servers would be fine - as long as
>> you don't need to run searches across CSV tables stored across different
>> databases (in which case you do need index/search federation of some kind).
>>
>> Regarding Erik Brandsberg's point about XFS, I think this is a useful
>> alternative approach, if I understand the idea. Instead of storing your CSV
>> files in Postgres, just store them as CSV files on the file system. You can
>> still store the schemas in Pg, but each schema would just point to a file
>> in the file system and you'd manipulate the files in the filesystem using
>> whatever language is appropriate (I find ruby to be excellent for managing
>> CSV files). If you need to index those files to run searches against them,
>> I'd direct your attention to https://prestodb.io/ which is the core
>> technology that runs Amazon Athena. This allows you to search CSV files
>> with various schema (among other data bindings). So you might find that Pg
>> as your schema storage, XFS (or any modern FS) to store large numbers of
>> CSV files, and Presto/Athena to index/search those files, along with some
>> CSV management language (like Ruby or something even higher level) to
>> manage the data.
>>
>> I think if I were dealing with less than 10k CSV files (and therefore Pg
>> tables), I might use Pg, and if I were dealing with 10k+ files, I'd start
>> looking at file systems + Presto. But that's a WAG.
>>
>> Steve
>>
>>
>>
>> I think the add/remove column requirement alone justifies NOT using
>> files. The CSV approach will temp the system to handle some versioning
>> nonsense. Using tables also provides some protection against the inevitable
>> garbage data in the CSVs.
>>
>
> I missed that requirement - totally agree. Adding columns to file-based
> CSVs is a huge PITA. Agree on your other point as well: just having the
> right charset requirements in Pg can help trim or identify some of the bad
> data that comes in from CSVs but the opportunities for all kinds of
> efficient data validation recommends storing the CSV data in Pg.
>
> I'll add that I've used the postgres COPY command to move CSV data into
> tables so fast that I had to double take the rows/sec to believe it. I
> actually shared a gist to demonstrate how to implement this using Ruby and
> a low level postgres driver a few months ago:
> https://gist.github.com/science/393907d4123c87ed767bc81e9dd5a7da Maybe
> useful..
>

In response to

Re: Storing thousands of csv files in postgresql at 2022-02-15 21:21:46 from Steve Midgley

Browse pgsql-sql by date

	From	Date	Subject
Next Message	Shaozhong SHI	2022-02-16 01:27:56	Is there a way to automatically scan a table and determine the format of data
Previous Message	Ion Alberdi	2022-02-15 21:21:59	Re: Storing thousands of csv files in postgresql