| From: | Anto Aravinth <anto(dot)aravinth(dot)cse(at)gmail(dot)com> | 
|---|---|
| To: | Tim Cross <theophilusx(at)gmail(dot)com> | 
| Cc: | Adrien Nayrat <adrien(dot)nayrat(at)anayrat(dot)info>, "pgsql-generallists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org> | 
| Subject: | Re: Using COPY to import large xml file | 
| Date: | 2018-06-25 14:25:25 | 
| Message-ID: | CANtp6R+4mm9HmrcVbSCCE_k3gXeCgnbuoL7UjibZoZJwoehrcg@mail.gmail.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-general | 
Thanks a lot. But I do got lot of challenges! Looks like SO data contains
lot of tabs within itself.. So tabs delimiter didn't work for me. I thought
I can give a special demiliter but looks like Postrgesql copy allow only
one character as delimiter :(
Sad, I guess only way is to insert or do a through serialization of my data
into something that COPY can understand.
On Mon, Jun 25, 2018 at 8:09 AM, Tim Cross <theophilusx(at)gmail(dot)com> wrote:
>
>
> On Mon, 25 Jun 2018 at 11:38, Anto Aravinth <anto(dot)aravinth(dot)cse(at)gmail(dot)com>
> wrote:
>
>>
>>
>> On Mon, Jun 25, 2018 at 3:44 AM, Tim Cross <theophilusx(at)gmail(dot)com> wrote:
>>
>>>
>>> Anto Aravinth <anto(dot)aravinth(dot)cse(at)gmail(dot)com> writes:
>>>
>>> > Thanks for the response. I'm not sure, how long does this tool takes
>>> for
>>> > the 70GB data.
>>> >
>>> > I used node to stream the xml files into inserts.. which was very
>>> slow..
>>> > Actually the xml contains 40 million records, out of which 10Million
>>> took
>>> > around 2 hrs using nodejs. Hence, I thought will use COPY command, as
>>> > suggested on the internet.
>>> >
>>> > Definitely, will try the code and let you know.. But looks like it
>>> uses the
>>> > same INSERT, not copy.. interesting if it runs quick on my machine.
>>> >
>>> > On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <
>>> adrien(dot)nayrat(at)anayrat(dot)info>
>>> > wrote:
>>> >
>>> >> On 06/24/2018 05:25 PM, Anto Aravinth wrote:
>>> >> > Hello Everyone,
>>> >> >
>>> >> > I have downloaded the Stackoverflow posts xml (contains all SO
>>> questions
>>> >> till
>>> >> > date).. the file is around 70GB.. I wanna import the data in those
>>> xml
>>> >> to my
>>> >> > table.. is there a way to do so in postgres?
>>> >> >
>>> >> >
>>> >> > Thanks,
>>> >> > Anto.
>>> >>
>>> >> Hello Anto,
>>> >>
>>> >> I used this tool :
>>> >> https://github.com/Networks-Learning/stackexchange-dump-to-postgres
>>> >>
>>>
>>> If you are using nodejs, then you can easily use the pg-copy-streams
>>> module to insert the records into your database. I've been using this
>>> for inserting large numbers of records from NetCDF files. Takes between
>>> 40 to 50 minutes to insert 60 Million+ records and we are doing
>>> additional calculations on the values, not just inserting them,
>>> plus we are inserting into a database over the network and into a
>>> database which is
>>> also performing other processing.
>>>
>>> We found a significant speed improvement with COPY over blocks of insert
>>> transactions, which was faster than just individual inserts. The only
>>> downside with using COPY is that it either completely works or
>>> completely fails and when it fails, it can be tricky to work out which
>>> record is causing the failure. A benefit of using blocks of transactions
>>> is that you have more fine grained control, allowing you to recover from
>>> some errors or providing more specific detail regarding the cause of the
>>> error.
>>>
>>
>> Sure, let me try that.. I have a question here, COPY usually works when
>> you move data from files to your postgres instance, right? Now in node.js,
>> processing the whole file, can I use COPY
>> programmatically like COPY Stackoverflow <calculated value at run time>?
>> Because from doc:
>>
>> https://www.postgresql.org/docs/9.2/static/sql-copy.html
>>
>> I don't see its possible. May be I need to convert the files to copy
>> understandable first?
>>
>> Anto.
>>
>>>
>>>
>>
> Yes. Essentially what you do is create a stream and feed whatever
> information you want to copy into that stream. PG sees the. data as if it
> was seeing each line in a file, so you push data onto the stream wherre
> each item is seperated by a tab (or whatever). Here is the basic low level
> function I use (Don't know how the formatting will go!)
>
> async function copyInsert(sql, stringifyFN, records) {
>   const logName = `${moduleName}.copyInsert`;
>   var client;
>
>   assert.ok(Array.isArray(records), "The records arg must be an array");
>   assert.ok(typeof(stringifyFN) === "function", "The stringifyFN arg must
> be a function");
>
>   return getClient()
>     .then(c => {
>       client = c;
>       return new Promise(function(resolve, reject) {
>         var stream, rs;
>         var idx = 0;
>
>         function done() {
>           releaseClient(client);
>           client = undefined;
>           resolve(idx + 1);
>         }
>
>         function onError(err) {
>           if (client !== undefined) {
>             releaseClient(client);
>           }
>           reject(new VError(err, `${logName}: COPY failed at record
> ${idx}`));
>         }
>
>         function arrayRead() {
>           if (idx === records.length) {
>             rs.push(null);
>           } else {
>             let rec = records[idx];
>             rs.push(stringifyFN(rec));
>             idx += 1;
>           }
>         }
>
>         rs = new Readable;
>         rs._read = arrayRead;
>         rs.on("error", onError);
>         stream = client.query(copyFrom(sql));
>         stream.on("error", onError);
>         stream.on("end", done);
>         rs.pipe(stream);
>       });
>     })
>     .catch(err => {
>       throw new VError(err, `${logName} Failed COPY insert`);
>     });
> }
>
> and I will call it like
>
> copyInsert(sql, stringifyClimateRecord, records)
>
> where sql and stringifycomateRecord arguments are
>
>   const sql = `COPY access_s.climate_data_ensemble_${ensemble} `
>         + "(forecast_dt,awap_id,rain,temp_min,temp_max,rsds,"
>         + "vprp_09,vprp_15,wind_speed) FROM STDIN";
>
>   function stringifyClimateRecord(rec) {
>     return `'${rec[0].format("YYYY-MM-DD")}'\t${rec[2]}\t${rec[3]}\t`
>       + `${rec[4]}\t${rec[5]}\t${rec[6]}\t${rec[7]}\t${rec[8]}\t${
> rec[9]}\n`;
>   }
>
> The stringifyClimateRecord returns a record to be inserted as a 'line'
> into the stream with values separated by tabs. Records is an array of data
> records where each record is an array.
>
>
> --
> regards,
>
> Tim
>
> --
> Tim Cross
>
>
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Nicolas Paris | 2018-06-25 14:50:16 | Re: Using COPY to import large xml file | 
| Previous Message | Bruce Momjian | 2018-06-25 14:22:33 | Re: PostgreSQL : encryption with pgcrypto |