Quick Links

Re: Copy Bulk Ignore Duplicated

From:	Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
To:	Leandro Guimarães <leo(dot)guimaraes(at)gmail(dot)com>
Cc:	Tim Cross <theophilusx(at)gmail(dot)com>, pgsql-general(at)lists(dot)postgresql(dot)org, pgsql-general(at)postgresql(dot)org
Subject:	Re: Copy Bulk Ignore Duplicated
Date:	2019-06-17 15:45:47
Message-ID:	c051176c-3727-5fc3-fcf2-6550abe0714f@aklaver.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On 6/17/19 8:14 AM, Leandro Guimarães wrote:
> Hi Adrian,
>
> You are right, these fields are in CHECK CONSTRAiNTS and they are
> not formally defined as Primary Keys.

Alright. Two things:

1) If you are are thinking of them as keys, why not make them a PK or a
UNIQUE index?

2) Still not clear to me whether you are looking for duplicated
information within a row or between rows?
To put it another way, what are the CHECK constraints doing?

>
> Thanks!
> Leandro Guimarães
>
>
>
> On Sat, Jun 15, 2019 at 10:45 AM Adrian Klaver
> <adrian(dot)klaver(at)aklaver(dot)com <mailto:adrian(dot)klaver(at)aklaver(dot)com>> wrote:
>
> On 6/14/19 7:24 PM, Leandro Guimarães wrote:
> > Hi Tim, thanks for you answer!
> >
> > The columns were just examples, but let me explain the database
> > structure, the fields in *bold are the keys*:
> >
> > *customer_id integer*
> > *date_time timestamp*
> > *indicator_id integer*
> > *element_id integer*
> > indicator_value double precision
>
> Huh, earlier you said you had a check constraint that was causing
> issues.
>
> Does that also exist or where you referring to the keys above?
>
> Are the keys above formally defined as the PRIMARY KEY?
> -
> >
> > The table is partitioned per day and customer_id (it works great)
> the
> > problem is just the duplicated key situation that I'm really
> worried about.
> >
> > I populate the database via a Java Application with JDBC.
> >
> > Maybe this info could help to provide some light!
> >
> > Thanks Again!
> >
> > Leandro Guimarães
> >
> >
> >
> > On Fri, Jun 14, 2019 at 7:39 PM Tim Cross <theophilusx(at)gmail(dot)com
> <mailto:theophilusx(at)gmail(dot)com>
> > <mailto:theophilusx(at)gmail(dot)com <mailto:theophilusx(at)gmail(dot)com>>> wrote:
> >
> >
> > Leandro Guimarães <leo(dot)guimaraes(at)gmail(dot)com
> <mailto:leo(dot)guimaraes(at)gmail(dot)com>
> > <mailto:leo(dot)guimaraes(at)gmail(dot)com
> <mailto:leo(dot)guimaraes(at)gmail(dot)com>>> writes:
> >
> > > Hi,
> > >
> > > I have a scenario with a large table and I'm trying to
> insert
> > it via a
> > > COPY command with a csv file.
> > >
> > > Everything works, but sometimes my source .csv file has
> > duplicated data
> > > in the previously fulfilled table. If I add a check constraint
> > and try to
> > > run the COPY command I have an error that stops the whole
> insertion.
> > >
> > > I've tried to put the data in a tmp table and fill the
> main using
> > > distinct this way (the fields and names are just examples):
> > >
> > > INSERT INTO final_table values (name, document)
> > > SELECT DISTINCT name, document
> > > FROM tmp_TABLE t1
> > > WHERE NOT EXISTS (
> > > SELECT 1 FROM final_table t2
> > > WHERE (t2.name <http://t2.name> <http://t2.name>,
> t2.document)
> > > IS NOT DISTINCT FROM (t1.name <http://t1.name>
> <http://t1.name>, t1.document))
> > >
> > > The problem is that my final_table is a large (and
> partitioned)
> > table and
> > > this query is taking a long time to execute.
> > >
> > > Someone have any idea (really guys anything would be
> great) how
> > to solve
> > > this situation? I need to ignore duplicates instead to
> have some
> > error.
> > >
> > > I'm using* PostgreSQL 9.4* so I can't use "ON CONFLICT" and
> > upgrade is not
> > > an option.
> > >
> >
> > Explain plan would probably shed some light, but I suspect your
> > performance is being heavily hit by the sub query. Distinct is an
> > expensive operation and you are performing it once for every
> > distinct row
> > in your temp table.
> >
> > It isn't clear what the primary key is for your final table -
> name +
> > document seems suspicious given these seem to be the only two
> columns
> > your inserting as well. You don't indicate what the data
> types are
> > either - it document is something like 'text' then using it in a
> > distinct clause is likely to have huge performance impact.
> >
> > The first thing I'd do is to eliminate duplicates from your
> temp table
> > as a separate statement or by pre-filtering the CSV before
> import. I
> > would then try something like an outer join to identify rows
> in your
> > temp table which don't exist in your final table and select
> from there
> > to insert into the final table. You don't really need the
> distinct in
> > the sub query as all you really need to know is if (name,
> document)
> > exists - it doesn't matter if more than one exists (for this
> test).
> >
> > If you really don't have something more specific for a
> primary key,
> > depending on what data type 'document' is and how large it
> is, you may
> > find adding a column which is a checksum of your 'document'
> field a
> > useful addition. I have done this in the past where I had an
> application
> > where name was not unique and we only wanted distinct
> instances of
> > 'document' (document was a fairly large XML document in this
> case).
> >
> > --
> > Tim Cross
> >
> >
>
>
> --
> Adrian Klaver
> adrian(dot)klaver(at)aklaver(dot)com <mailto:adrian(dot)klaver(at)aklaver(dot)com>
>

--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com

In response to

Re: Copy Bulk Ignore Duplicated at 2019-06-17 15:14:14 from Leandro Guimarães

Responses

Re: Copy Bulk Ignore Duplicated at 2019-06-17 16:22:21 from Adrian Klaver

Browse pgsql-general by date

	From	Date	Subject
Next Message	Jesús Gómez	2019-06-17 15:59:18	psql timeout option
Previous Message	Leandro Guimarães	2019-06-17 15:14:14	Re: Copy Bulk Ignore Duplicated