From: | Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com> |
---|---|
To: | Anto Aravinth <anto(dot)aravinth(dot)cse(at)gmail(dot)com>, Tim Cross <theophilusx(at)gmail(dot)com> |
Cc: | Adrien Nayrat <adrien(dot)nayrat(at)anayrat(dot)info>, "pgsql-generallists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Using COPY to import large xml file |
Date: | 2018-06-25 19:36:09 |
Message-ID: | d3c834a0-edb0-4363-d882-e36422ebcdc9@aklaver.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On 06/25/2018 07:25 AM, Anto Aravinth wrote:
> Thanks a lot. But I do got lot of challenges! Looks like SO data
> contains lot of tabs within itself.. So tabs delimiter didn't work for
> me. I thought I can give a special demiliter but looks like Postrgesql
> copy allow only one character as delimiter :(
I use | as it is rarely found in data itself.
>
> Sad, I guess only way is to insert or do a through serialization of my
> data into something that COPY can understand.
>
> On Mon, Jun 25, 2018 at 8:09 AM, Tim Cross <theophilusx(at)gmail(dot)com
> <mailto:theophilusx(at)gmail(dot)com>> wrote:
>
>
>
> On Mon, 25 Jun 2018 at 11:38, Anto Aravinth
> <anto(dot)aravinth(dot)cse(at)gmail(dot)com <mailto:anto(dot)aravinth(dot)cse(at)gmail(dot)com>>
> wrote:
>
>
>
> On Mon, Jun 25, 2018 at 3:44 AM, Tim Cross
> <theophilusx(at)gmail(dot)com <mailto:theophilusx(at)gmail(dot)com>> wrote:
>
>
> Anto Aravinth <anto(dot)aravinth(dot)cse(at)gmail(dot)com
> <mailto:anto(dot)aravinth(dot)cse(at)gmail(dot)com>> writes:
>
> > Thanks for the response. I'm not sure, how long does this tool takes for
> > the 70GB data.
> >
> > I used node to stream the xml files into inserts.. which was very slow..
> > Actually the xml contains 40 million records, out of which 10Million took
> > around 2 hrs using nodejs. Hence, I thought will use COPY command, as
> > suggested on the internet.
> >
> > Definitely, will try the code and let you know.. But looks like it uses the
> > same INSERT, not copy.. interesting if it runs quick on my machine.
> >
> > On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <adrien(dot)nayrat(at)anayrat(dot)info <mailto:adrien(dot)nayrat(at)anayrat(dot)info>>
> > wrote:
> >
> >> On 06/24/2018 05:25 PM, Anto Aravinth wrote:
> >> > Hello Everyone,
> >> >
> >> > I have downloaded the Stackoverflow posts xml (contains all SO questions
> >> till
> >> > date).. the file is around 70GB.. I wanna import the data in those xml
> >> to my
> >> > table.. is there a way to do so in postgres?
> >> >
> >> >
> >> > Thanks,
> >> > Anto.
> >>
> >> Hello Anto,
> >>
> >> I used this tool :
> >> https://github.com/Networks-Learning/stackexchange-dump-to-postgres
> <https://github.com/Networks-Learning/stackexchange-dump-to-postgres>
> >>
>
> If you are using nodejs, then you can easily use the
> pg-copy-streams
> module to insert the records into your database. I've been
> using this
> for inserting large numbers of records from NetCDF files.
> Takes between
> 40 to 50 minutes to insert 60 Million+ records and we are doing
> additional calculations on the values, not just inserting them,
> plus we are inserting into a database over the network and
> into a database which is
> also performing other processing.
>
> We found a significant speed improvement with COPY over
> blocks of insert
> transactions, which was faster than just individual inserts.
> The only
> downside with using COPY is that it either completely works or
> completely fails and when it fails, it can be tricky to work
> out which
> record is causing the failure. A benefit of using blocks of
> transactions
> is that you have more fine grained control, allowing you to
> recover from
> some errors or providing more specific detail regarding the
> cause of the
> error.
>
>
> Sure, let me try that.. I have a question here, COPY usually
> works when you move data from files to your postgres instance,
> right? Now in node.js, processing the whole file, can I use COPY
> programmatically like COPY Stackoverflow <calculated value at
> run time>? Because from doc:
>
> https://www.postgresql.org/docs/9.2/static/sql-copy.html
> <https://www.postgresql.org/docs/9.2/static/sql-copy.html>
>
> I don't see its possible. May be I need to convert the files to
> copy understandable first?
>
> Anto.
>
>
>
>
> Yes. Essentially what you do is create a stream and feed whatever
> information you want to copy into that stream. PG sees the. data as
> if it was seeing each line in a file, so you push data onto the
> stream wherre each item is seperated by a tab (or whatever). Here is
> the basic low level function I use (Don't know how the formatting
> will go!)
>
> async function copyInsert(sql, stringifyFN, records) {
> const logName = `${moduleName}.copyInsert`;
> var client;
>
> assert.ok(Array.isArray(records), "The records arg must be an
> array");
> assert.ok(typeof(stringifyFN) === "function", "The stringifyFN
> arg must be a function");
> return getClient()
> .then(c => {
> client = c;
> return new Promise(function(resolve, reject) {
> var stream, rs;
> var idx = 0;
> function done() {
> releaseClient(client);
> client = undefined;
> resolve(idx + 1);
> }
>
> function onError(err) {
> if (client !== undefined) {
> releaseClient(client);
> }
> reject(new VError(err, `${logName}: COPY failed at record
> ${idx}`));
> }
>
> function arrayRead() {
> if (idx === records.length) {
> rs.push(null);
> } else {
> let rec = records[idx];
> rs.push(stringifyFN(rec));
> idx += 1;
> }
> }
>
> rs = new Readable;
> rs._read = arrayRead;
> rs.on("error", onError);
> stream = client.query(copyFrom(sql));
> stream.on("error", onError);
> stream.on("end", done);
> rs.pipe(stream);
> });
> })
> .catch(err => {
> throw new VError(err, `${logName} Failed COPY insert`);
> });
> }
>
> and I will call it like
>
> copyInsert(sql, stringifyClimateRecord, records)
>
> where sql and stringifycomateRecord arguments are
>
> const sql = `COPY access_s.climate_data_ensemble_${ensemble} `
> + "(forecast_dt,awap_id,rain,temp_min,temp_max,rsds,"
> + "vprp_09,vprp_15,wind_speed) FROM STDIN";
>
> function stringifyClimateRecord(rec) {
> return `'${rec[0].format("YYYY-MM-DD")}'\t${rec[2]}\t${rec[3]}\t`
> +
> `${rec[4]}\t${rec[5]}\t${rec[6]}\t${rec[7]}\t${rec[8]}\t${rec[9]}\n`;
> }
>
> The stringifyClimateRecord returns a record to be inserted as a
> 'line' into the stream with values separated by tabs. Records is an
> array of data records where each record is an array.
>
>
> --
> regards,
>
> Tim
>
> --
> Tim Cross
>
>
--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com
From | Date | Subject | |
---|---|---|---|
Next Message | Adrian Klaver | 2018-06-25 19:40:59 | Re: Schema/Data conversion opensource tools from MySQL to PostgreSQL |
Previous Message | Alban Hertroys | 2018-06-25 19:01:58 | Re: DB size growing exponentially when materialized view refreshed concurrently (postgres 9.6) |