From: | Arjen Nienhuis <a(dot)g(dot)nienhuis(at)gmail(dot)com> |
---|---|
To: | Johann Spies <johann(dot)spies(at)gmail(dot)com> |
Cc: | pgsql-general <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: xml-file as foreign table? |
Date: | 2016-04-29 05:01:18 |
Message-ID: | CAG6W84KcdJaDjJ_CDRnvTQc2JBdSzHJQQgbQ8fyqaLeBOvaLyQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Apr 28, 2016 14:33, "Johann Spies" <johann(dot)spies(at)gmail(dot)com> wrote:
>
> I have several large (7GB+) xml files to get into an SQL database.
>
> The xml-files can contain up to 500 000 subrecords which I want to be
able to query in the database.
>
> They are too large to do something like this:
>
>
> insert into rawxml (xml) select XMLPARSE (DOCUMENT
CONVERT_FROM(PG_READ_BINARY_FILE('FOO.xml' ), 'UTF8'));
>
> If it were possible, each file would be one huge record in the table
which can then be unpacked using XPATH.
>
>
> The options I am considering is :
>
> 1. Unpack the individual records (will be more than 50 million) using
something like python with lxml and psycopg2 and insert them after dropping
all indexes and triggers on the table
>
> 2. Unpack the individual records and write a (very) large tsv-file and
then insert it using 'copy'
>
The fastest way I found is to combine these two. Using iterparse from lxml
combined with load_rows and COPY from py-postgresql:
http://python.projects.pgfoundry.org/docs/1.1/driver.html#copy-statements
That way you can stream the data.
> It would be convenient If I could use the present xml files as 'foreign
tables' and parse them using the xpath-capabilities of PostgreSQL.
>
> Is this possible?
>
There is a multicorn fdw for that:
https://github.com/Kozea/Multicorn/blob/master/
<https://github.com/Kozea/Multicorn/blob/master/python/multicorn/xmlfdw.py>
python
<https://github.com/Kozea/Multicorn/blob/master/python/multicorn/xmlfdw.py>/
<https://github.com/Kozea/Multicorn/blob/master/python/multicorn/xmlfdw.py>
multicorn
<https://github.com/Kozea/Multicorn/blob/master/python/multicorn/xmlfdw.py>/
<https://github.com/Kozea/Multicorn/blob/master/python/multicorn/xmlfdw.py>
xmlfdw.py
<https://github.com/Kozea/Multicorn/blob/master/python/multicorn/xmlfdw.py>
But I never tried it. It looks like it loads all rows in a python list.
Groeten, Arjen
From | Date | Subject | |
---|---|---|---|
Next Message | Yogesh Sharma | 2016-04-29 10:26:59 | Re: Issue during postgresql startup |
Previous Message | Tim van der Linden | 2016-04-29 00:19:11 | Re: Slow join over three tables |