From: | Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com> |
---|---|
To: | Anto Aravinth <anto(dot)aravinth(dot)cse(at)gmail(dot)com>, pgsql-general(at)lists(dot)postgresql(dot)org |
Subject: | Re: Using COPY to import large xml file |
Date: | 2018-06-24 16:57:58 |
Message-ID: | 3b83c126-c9f6-0d1e-fecc-13fc0c8e59ec@aklaver.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On 06/24/2018 08:25 AM, Anto Aravinth wrote:
> Hello Everyone,
>
> I have downloaded the Stackoverflow posts xml (contains all SO questions
> till date).. the file is around 70GB.. I wanna import the data in those
> xml to my table.. is there a way to do so in postgres?
It is going to require some work. You will need to deal with:
1) The row schema inside the XML is here:
https://ia800107.us.archive.org/27/items/stackexchange/readme.txt
- **posts**.xml
2) The rows are inside a <posts> tag.
Seems to me you have two options:
1) Drop each row into a single XML field and deal with extracting the
row components in the database.
2) Break down the row into column components before entering them into
the database.
Adrien has pointed you at a Python program that covers the above:
https://github.com/Networks-Learning/stackexchange-dump-to-postgres
If you are comfortable in Python you can take a look at:
https://github.com/Networks-Learning/stackexchange-dump-to-postgres/blob/master/row_processor.py
to see how the rows are broken down into elements.
I would try this out first on one of the smaller datasets found here:
https://archive.org/details/stackexchange
I personally took a look at:
https://archive.org/download/stackexchange/beer.stackexchange.com.7z
because why not?
>
>
> Thanks,
> Anto.
--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com
From | Date | Subject | |
---|---|---|---|
Next Message | Christoph Moench-Tegeder | 2018-06-24 17:45:44 | Re: Using COPY to import large xml file |
Previous Message | Adrien Nayrat | 2018-06-24 16:15:33 | Re: Using COPY to import large xml file |