Re: large xml database

From: Rob Sargent <robjsargent(at)gmail(dot)com>
To: Andreas Joseph Krogh <andreak(at)officenet(dot)no>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: large xml database
Date: 2010-10-30 22:42:48
Message-ID: 4CCC9F68.9050405@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Andreas Joseph Krogh wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> On 10/30/2010 11:49 PM, Viktor Bojović wrote:
>
>> Hi,
>> i have very big XML documment which is larger than 50GB and want to import
>> it into databse, and transform it to relational schema.
>> When splitting this documment to smaller independent xml documments i get
>> ~11.1mil XML documents.
>> I have spent lots of time trying to get fastest way to transform all this
>> data but every time i give up because it takes too much time. Sometimes more
>> than month it would take if not stopped.
>> I have tried to insert each line as varchar into database and parse it using
>> plperl regex..
>> also i have tried to store every documment as XML and parse it, but it is
>> also to slow.
>> i have tried to store every documment as varchar but it is also slow when
>> using regex to get data.
>>
>> many tries have failed because 8GB of ram and 10gb of swap were not enough.
>> also sometimes i get that more than 2^32 operations were performed, and
>> functions stopped to work.
>>
>> i wanted just to ask if someone knows how to speed this up.
>>
>> thanx in advance
>>
>
> Use a SAX-parser and handle the endElement(String name) events to insert
> the element's content into your db.
>
>
If you still have the 11 million subfiles, I would start there, sax
parse as above and maybe make make csv files, then load those with bulk
as begin/end transaction on each data item discovered will hurt.

Can the subfiles be segregated into specific data types, or at least
holder of specific data types such that they releate to a specific
subset of your new db/schema? This will play into what get's loaded
first and who depends on whom w.r.t. foreign keys etc.

You can parallelize marginally with multiple threads (hoping to split
file read from sax paring from element construction from save/network)
but more boxes would be the way to go. Partitioning remains a problem.

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message James Cloos 2010-10-31 00:26:45 Re: large xml database
Previous Message Andreas Joseph Krogh 2010-10-30 22:06:25 Re: large xml database