Re: Using COPY to import large xml file

From: Tim Cross <theophilusx(at)gmail(dot)com>
To: Anto Aravinth <anto(dot)aravinth(dot)cse(at)gmail(dot)com>
Cc: Adrien Nayrat <adrien(dot)nayrat(at)anayrat(dot)info>, "pgsql-generallists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: Using COPY to import large xml file
Date: 2018-06-25 02:39:16
Message-ID: CAC=50j-RRKf-pTnbAYwcKJi90yYZ9T=L+JyNhtJKuZjDY1m=ow@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, 25 Jun 2018 at 11:38, Anto Aravinth <anto(dot)aravinth(dot)cse(at)gmail(dot)com>
wrote:

>
>
> On Mon, Jun 25, 2018 at 3:44 AM, Tim Cross <theophilusx(at)gmail(dot)com> wrote:
>
>>
>> Anto Aravinth <anto(dot)aravinth(dot)cse(at)gmail(dot)com> writes:
>>
>> > Thanks for the response. I'm not sure, how long does this tool takes for
>> > the 70GB data.
>> >
>> > I used node to stream the xml files into inserts.. which was very slow..
>> > Actually the xml contains 40 million records, out of which 10Million
>> took
>> > around 2 hrs using nodejs. Hence, I thought will use COPY command, as
>> > suggested on the internet.
>> >
>> > Definitely, will try the code and let you know.. But looks like it uses
>> the
>> > same INSERT, not copy.. interesting if it runs quick on my machine.
>> >
>> > On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <
>> adrien(dot)nayrat(at)anayrat(dot)info>
>> > wrote:
>> >
>> >> On 06/24/2018 05:25 PM, Anto Aravinth wrote:
>> >> > Hello Everyone,
>> >> >
>> >> > I have downloaded the Stackoverflow posts xml (contains all SO
>> questions
>> >> till
>> >> > date).. the file is around 70GB.. I wanna import the data in those
>> xml
>> >> to my
>> >> > table.. is there a way to do so in postgres?
>> >> >
>> >> >
>> >> > Thanks,
>> >> > Anto.
>> >>
>> >> Hello Anto,
>> >>
>> >> I used this tool :
>> >> https://github.com/Networks-Learning/stackexchange-dump-to-postgres
>> >>
>>
>> If you are using nodejs, then you can easily use the pg-copy-streams
>> module to insert the records into your database. I've been using this
>> for inserting large numbers of records from NetCDF files. Takes between
>> 40 to 50 minutes to insert 60 Million+ records and we are doing
>> additional calculations on the values, not just inserting them,
>> plus we are inserting into a database over the network and into a
>> database which is
>> also performing other processing.
>>
>> We found a significant speed improvement with COPY over blocks of insert
>> transactions, which was faster than just individual inserts. The only
>> downside with using COPY is that it either completely works or
>> completely fails and when it fails, it can be tricky to work out which
>> record is causing the failure. A benefit of using blocks of transactions
>> is that you have more fine grained control, allowing you to recover from
>> some errors or providing more specific detail regarding the cause of the
>> error.
>>
>
> Sure, let me try that.. I have a question here, COPY usually works when
> you move data from files to your postgres instance, right? Now in node.js,
> processing the whole file, can I use COPY
> programmatically like COPY Stackoverflow <calculated value at run time>?
> Because from doc:
>
> https://www.postgresql.org/docs/9.2/static/sql-copy.html
>
> I don't see its possible. May be I need to convert the files to copy
> understandable first?
>
> Anto.
>
>>
>>
>
Yes. Essentially what you do is create a stream and feed whatever
information you want to copy into that stream. PG sees the. data as if it
was seeing each line in a file, so you push data onto the stream wherre
each item is seperated by a tab (or whatever). Here is the basic low level
function I use (Don't know how the formatting will go!)

async function copyInsert(sql, stringifyFN, records) {
const logName = `${moduleName}.copyInsert`;
var client;

assert.ok(Array.isArray(records), "The records arg must be an array");
assert.ok(typeof(stringifyFN) === "function", "The stringifyFN arg must
be a function");

return getClient()
.then(c => {
client = c;
return new Promise(function(resolve, reject) {
var stream, rs;
var idx = 0;

function done() {
releaseClient(client);
client = undefined;
resolve(idx + 1);
}

function onError(err) {
if (client !== undefined) {
releaseClient(client);
}
reject(new VError(err, `${logName}: COPY failed at record
${idx}`));
}

function arrayRead() {
if (idx === records.length) {
rs.push(null);
} else {
let rec = records[idx];
rs.push(stringifyFN(rec));
idx += 1;
}
}

rs = new Readable;
rs._read = arrayRead;
rs.on("error", onError);
stream = client.query(copyFrom(sql));
stream.on("error", onError);
stream.on("end", done);
rs.pipe(stream);
});
})
.catch(err => {
throw new VError(err, `${logName} Failed COPY insert`);
});
}

and I will call it like

copyInsert(sql, stringifyClimateRecord, records)

where sql and stringifycomateRecord arguments are

const sql = `COPY access_s.climate_data_ensemble_${ensemble} `
+ "(forecast_dt,awap_id,rain,temp_min,temp_max,rsds,"
+ "vprp_09,vprp_15,wind_speed) FROM STDIN";

function stringifyClimateRecord(rec) {
return `'${rec[0].format("YYYY-MM-DD")}'\t${rec[2]}\t${rec[3]}\t`
+
`${rec[4]}\t${rec[5]}\t${rec[6]}\t${rec[7]}\t${rec[8]}\t${rec[9]}\n`;
}

The stringifyClimateRecord returns a record to be inserted as a 'line' into
the stream with values separated by tabs. Records is an array of data
records where each record is an array.

--
regards,

Tim

--
Tim Cross

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Christoph Moench-Tegeder 2018-06-25 06:10:45 Re: Using COPY to import large xml file
Previous Message Anto Aravinth 2018-06-25 01:38:52 Re: Using COPY to import large xml file