Re: Loading 500m json files to database

From: Rob Sargent <robjsargent(at)gmail(dot)com>
To: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
Cc: pinker <pinker(at)onet(dot)eu>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Loading 500m json files to database
Date: 2020-03-24 01:51:27
Message-ID: 3A9D9AC0-AD61-45C6-87AD-BD8C9E255A37@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

> On Mar 23, 2020, at 7:11 PM, David G. Johnston <david(dot)g(dot)johnston(at)gmail(dot)com> wrote:
>
> On Mon, Mar 23, 2020 at 3:24 AM pinker <pinker(at)onet(dot)eu <mailto:pinker(at)onet(dot)eu>> wrote:
> time for i in datafiles/*; do
> psql -c "\copy json_parts(json_data) FROM $i"&
> done
>
> Don't know whether this is faster but it does avoid spinning up a connection multiple times.
>
> #bash, linux
> function append_each_split_file_to_etl_load_script() {
> for filetoload in ./*; do
> ronumber="$(basename $filetoload)"
> # only process files since subdirs can be present
> if [[ -f "$filetoload" ]]; then
> echo ""
> echo "\set invoice"' `cat '"'""$filetoload""'"'`'
> echo ", ('$ronumber',:'invoice')"
> fi >> "$PSQLSCRIPT"
> done
>
> echo "" >> "$PSQLSCRIPT"
> echo ";" >> "$PSQLSCRIPT"
> echo "" >> "$PSQLSCRIPT"
> }
>
> There is a bit other related code that is needed (for my specific usage) but this is the core of it. Use psql variables to capture the contents of each file into a variable and then just perform a normal insert (specifically, a VALUES (...), (...) variant). Since you can intermix psql and SQL you basically output a bloody long script, that has memory issues at scale - but you can divide and conquer - and then "psql --file bloody_long_script_part_1_of_100000.psql".
>
> David J.
>
Can one put 550M files in a single directory? I thought it topped out at 16M or so.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message pinker 2020-03-24 03:40:51 Re: Loading 500m json files to database
Previous Message David G. Johnston 2020-03-24 01:11:28 Re: Loading 500m json files to database