Quick Links

Re: Slow trigger on insert: alternative solutions?

From:	Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
To:	Guillaume Drolet <droletguillaume(at)gmail(dot)com>
Cc:	PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Slow trigger on insert: alternative solutions?
Date:	2015-03-30 21:05:00
Message-ID:	5519BA7C.8030000@aklaver.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On 03/30/2015 01:43 PM, Guillaume Drolet wrote:
> Thanks Adrian,
>
> In option 2, when you say "this is automated in an external Python
> script", do you mean that you use something like psycopg2 to perform the
> queries the database (e.g. for comparing data in the holding table with
> the older table)?

Yes. Basically I use the dbf module I mentioned previously to read the
DBF files, output the data I need, writing it into an in memory CSV file
which I then use the psycopg2 COPY functions to dump into the Postgres
database. The files I am working with are not as the large the ones you
mention, so doing all this in memory is workable. The script then calls
a series of user functions in Postgres to do the comparing and
manipulating. At the time I did this plpythonu was less featured then
it is now, so to do what I wanted made more sense in an external script.
Also the script pulls the DBF files from elsewhere and I felt more
comfortable doing that outside the database then in.

Though more and more I seem to be using pandas(pandas.pydata.org) to do
data conversions. Saves a lot of the steps in the above. In this case
you would still need to get the data out of the DBF files.

>
> Thanks.
>
> 2015-03-30 9:53 GMT-04:00 Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com
> <mailto:adrian(dot)klaver(at)aklaver(dot)com>>:
>
> On 03/30/2015 06:04 AM, Guillaume Drolet wrote:
>
> Hello,
>
> I need your help speeding up the procedure I will explain below.
> I am
> looking for improvements to my method or different
> approaches/ideas to
> would help in this matter.
>
> I have a set of DBF files that I load into my database using a
> plpython
> function and a call to ogr2ogr
> (http://www.gdal.org/drv_pg.__html
> <http://www.gdal.org/drv_pg.html>). Once
> in a while, I'll have to load updated versions of these tables
> to get
> the latest additions and possible corrections to older versions.
>
> In my plpython script, if a table is loaded for the first time,
> I first
> load it empty, then I create a trigger function on insert
> (execute on
> row) that will check for duplicates on each insert. Depending on the
> type of data I load, my trigger first checks for equality in a
> subset of
> columns (between 1 and 3 columns that would be like my primary
> key(s))
> and if true, I check if all columns are equal between NEW and the
> matching row from my table. When this condition is true, I
> return null,
> else I store rows (i.e. NEW.* and matching row(s) in a new table
> called
> "duplicate" for further manual investigation. Here's an example
> for one
> table:
>
> CREATE OR REPLACE FUNCTION check_naipf_insert()
> RETURNS trigger AS
> ' BEGIN
> IF EXISTS (SELECT 1
> FROM public.naipf
> WHERE id_pet_mes IS NOT DISTINCT FROM
> NEW.id_pet_mes
> AND etage IS NOT DISTINCT FROM NEW.etage) THEN
> IF EXISTS (SELECT 1
> FROM public.naipf
> WHERE id_pet_mes IS NOT DISTINCT FROM
> NEW.id_pet_mes
> AND etage IS NOT DISTINCT FROM NEW.etage
> AND type_couv IS NOT DISTINCT FROM
> NEW.type_couv
> AND densite IS NOT DISTINCT FROM
> NEW.densite
> AND hauteur IS NOT DISTINCT FROM
> NEW.hauteur
> AND cl_age IS NOT DISTINCT FROM
> NEW.cl_age) THEN
> RETURN NULL;
> ELSE
> INSERT INTO public.duplic_naipf SELECT NEW.*;
> INSERT INTO public.duplic_naipf (SELECT *
> FROM
> public.naipf
> WHERE
> id_pet_mes IS NOT DISTINCT FROM NEW.id_pet_mes
>
> AND etage
> IS NOT DISTINCT FROM NEW.etage );
> RETURN NULL;
> END IF;
> END IF;
> RETURN NEW;
> END; '
> LANGUAGE plpgsql VOLATILE COST 100;
>
> CREATE TRIGGER check_insert_naipf
> BEFORE INSERT
> ON public.pet4_naipf
> FOR EACH ROW
> EXECUTE PROCEDURE check_naipf_insert();
>
> (in this case, duplicate rows that need investigation are rows
> that may
> have changed relative to older version of the DBF file, but that
> have no
> change in what I call their primary keys although they are not
> really
> PKs since I don't want to raise errors at loading)
>
> Once this is done, ogr2ogr is called a second time to load the
> data. It
> is quite fast for small tables (tens of thousands of rows, tens of
> columns) but for large tables it takes forever. For example, I
> started
> loading a table with 3.5 million rows/33 columns last Friday at
> 3PM and
> this now, Monday morning at 9PM some 3 million rows have been
> loaded.
>
> My question is: what are the other approaches that would make this
> procedure faster? How is this kind of task usually implemented in
> postgresql? Would it be better to load everything with no check
> and then
> apply some functions to find duplicate rows (although this would
> involve
> more manual work)?
>
>
> I guess it depends on what end purpose of the above is? If you are
> just trying to keep relatively update to date information from the
> DBF sources, would it not be easier just to load them into a new table?
>
> So, where existing table is some_dbf_data:
>
> 1) CREATE TABLE new_some_dbf_data(...)
> 2) Dump DBF file into new_some_dbf_data
> 3)In transaction rename/drop some_dbf_data, rename new_some_dbf_data
> to some_dbf_data
>
>
> Option 2 is what I do for a similar procedure:
>
> 1) Dump DBF data into holding table.
> 2) Use SQL in function(s) to compare old/new table and make
> appropriate adjustments. Doing SQL in bulk is a lot faster then
> checking each row, or least that is what I found. In any case the
> way you are doing it looks to involve 3.5 million inserts with a
> trigger action on each, that is bound to be slow:)
> 3) This is automated in an external Python script.
>
> Option 3
>
> Use dbf(https://pypi.python.org/__pypi/dbf/0.88.16
> <https://pypi.python.org/pypi/dbf/0.88.16>) and do the comparisons
> in the DBF files outside Postgres and only import what has changed.
>
>
>
>
>
> Thanks a lot for your help!
>
>
>
>
>
>
> --
> Adrian Klaver
> adrian(dot)klaver(at)aklaver(dot)com <mailto:adrian(dot)klaver(at)aklaver(dot)com>
>
>

--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com

In response to

Re: Slow trigger on insert: alternative solutions? at 2015-03-30 20:43:38 from Guillaume Drolet

Browse pgsql-general by date

	From	Date	Subject
Next Message	Deven Phillips	2015-03-30 21:22:01	Re: Muti-table join and roll-up aggregate data into nested JSON?
Previous Message	avpro avpro	2015-03-30 20:51:16	Re: Re: [GENERAL] Link Office Word form document with data from PostgreSQL