Quick Links

Re: Copy Bulk Ignore Duplicated

From:	Leandro Guimarães <leo(dot)guimaraes(at)gmail(dot)com>
To:	Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
Cc:	Tim Cross <theophilusx(at)gmail(dot)com>, pgsql-general(at)lists(dot)postgresql(dot)org, pgsql-general(at)postgresql(dot)org
Subject:	Re: Copy Bulk Ignore Duplicated
Date:	2019-06-17 15:14:14
Message-ID:	CAJV35FN428XGBeNYUf_y5YnydTG4mW2Chk1w9_fLUMd-tJaKHA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Hi Adrian,

You are right, these fields are in CHECK CONSTRAiNTS and they are not
formally defined as Primary Keys.

Thanks!
Leandro Guimarães

On Sat, Jun 15, 2019 at 10:45 AM Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
wrote:

> On 6/14/19 7:24 PM, Leandro Guimarães wrote:
> > Hi Tim, thanks for you answer!
> >
> > The columns were just examples, but let me explain the database
> > structure, the fields in *bold are the keys*:
> >
> > *customer_id integer*
> > *date_time timestamp*
> > *indicator_id integer*
> > *element_id integer*
> > indicator_value double precision
>
> Huh, earlier you said you had a check constraint that was causing issues.
>
> Does that also exist or where you referring to the keys above?
>
> Are the keys above formally defined as the PRIMARY KEY?
> -
> >
> > The table is partitioned per day and customer_id (it works great) the
> > problem is just the duplicated key situation that I'm really worried
> about.
> >
> > I populate the database via a Java Application with JDBC.
> >
> > Maybe this info could help to provide some light!
> >
> > Thanks Again!
> >
> > Leandro Guimarães
> >
> >
> >
> > On Fri, Jun 14, 2019 at 7:39 PM Tim Cross <theophilusx(at)gmail(dot)com
> > <mailto:theophilusx(at)gmail(dot)com>> wrote:
> >
> >
> > Leandro Guimarães <leo(dot)guimaraes(at)gmail(dot)com
> > <mailto:leo(dot)guimaraes(at)gmail(dot)com>> writes:
> >
> > > Hi,
> > >
> > > I have a scenario with a large table and I'm trying to insert
> > it via a
> > > COPY command with a csv file.
> > >
> > > Everything works, but sometimes my source .csv file has
> > duplicated data
> > > in the previously fulfilled table. If I add a check constraint
> > and try to
> > > run the COPY command I have an error that stops the whole
> insertion.
> > >
> > > I've tried to put the data in a tmp table and fill the main
> using
> > > distinct this way (the fields and names are just examples):
> > >
> > > INSERT INTO final_table values (name, document)
> > > SELECT DISTINCT name, document
> > > FROM tmp_TABLE t1
> > > WHERE NOT EXISTS (
> > > SELECT 1 FROM final_table t2
> > > WHERE (t2.name <http://t2.name>, t2.document)
> > > IS NOT DISTINCT FROM (t1.name <http://t1.name>, t1.document))
> > >
> > > The problem is that my final_table is a large (and partitioned)
> > table and
> > > this query is taking a long time to execute.
> > >
> > > Someone have any idea (really guys anything would be great) how
> > to solve
> > > this situation? I need to ignore duplicates instead to have some
> > error.
> > >
> > > I'm using* PostgreSQL 9.4* so I can't use "ON CONFLICT" and
> > upgrade is not
> > > an option.
> > >
> >
> > Explain plan would probably shed some light, but I suspect your
> > performance is being heavily hit by the sub query. Distinct is an
> > expensive operation and you are performing it once for every
> > distinct row
> > in your temp table.
> >
> > It isn't clear what the primary key is for your final table - name +
> > document seems suspicious given these seem to be the only two columns
> > your inserting as well. You don't indicate what the data types are
> > either - it document is something like 'text' then using it in a
> > distinct clause is likely to have huge performance impact.
> >
> > The first thing I'd do is to eliminate duplicates from your temp
> table
> > as a separate statement or by pre-filtering the CSV before import. I
> > would then try something like an outer join to identify rows in your
> > temp table which don't exist in your final table and select from
> there
> > to insert into the final table. You don't really need the distinct in
> > the sub query as all you really need to know is if (name, document)
> > exists - it doesn't matter if more than one exists (for this test).
> >
> > If you really don't have something more specific for a primary key,
> > depending on what data type 'document' is and how large it is, you
> may
> > find adding a column which is a checksum of your 'document' field a
> > useful addition. I have done this in the past where I had an
> application
> > where name was not unique and we only wanted distinct instances of
> > 'document' (document was a fairly large XML document in this case).
> >
> > --
> > Tim Cross
> >
> >
>
>
> --
> Adrian Klaver
> adrian(dot)klaver(at)aklaver(dot)com
>

In response to

Re: Copy Bulk Ignore Duplicated at 2019-06-15 13:44:55 from Adrian Klaver

Responses

Re: Copy Bulk Ignore Duplicated at 2019-06-17 15:45:47 from Adrian Klaver

Browse pgsql-general by date

	From	Date	Subject
Next Message	Adrian Klaver	2019-06-17 15:45:47	Re: Copy Bulk Ignore Duplicated
Previous Message	Adrian Klaver	2019-06-17 14:11:32	Re: bug regclass::oid