Re: Copy Bulk Ignore Duplicated

From: Leandro Guimarães <leo(dot)guimaraes(at)gmail(dot)com>
To: Tim Cross <theophilusx(at)gmail(dot)com>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org, pgsql-general(at)postgresql(dot)org
Subject: Re: Copy Bulk Ignore Duplicated
Date: 2019-06-15 02:24:37
Message-ID: CAJV35FPDDhgTcvtp=e_PBvjBe0+5b3+KfGHrVSzK+nuBT5+uxw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Tim, thanks for you answer!

The columns were just examples, but let me explain the database structure,
the fields in *bold are the keys*:

*customer_id integer*
*date_time timestamp*
*indicator_id integer*
*element_id integer*
indicator_value double precision

The table is partitioned per day and customer_id (it works great) the
problem is just the duplicated key situation that I'm really worried about.

I populate the database via a Java Application with JDBC.

Maybe this info could help to provide some light!

Thanks Again!

Leandro Guimarães

On Fri, Jun 14, 2019 at 7:39 PM Tim Cross <theophilusx(at)gmail(dot)com> wrote:

>
> Leandro Guimarães <leo(dot)guimaraes(at)gmail(dot)com> writes:
>
> > Hi,
> >
> > I have a scenario with a large table and I'm trying to insert it via a
> > COPY command with a csv file.
> >
> > Everything works, but sometimes my source .csv file has duplicated
> data
> > in the previously fulfilled table. If I add a check constraint and try to
> > run the COPY command I have an error that stops the whole insertion.
> >
> > I've tried to put the data in a tmp table and fill the main using
> > distinct this way (the fields and names are just examples):
> >
> > INSERT INTO final_table values (name, document)
> > SELECT DISTINCT name, document
> > FROM tmp_TABLE t1
> > WHERE NOT EXISTS (
> > SELECT 1 FROM final_table t2
> > WHERE (t2.name, t2.document)
> > IS NOT DISTINCT FROM (t1.name, t1.document))
> >
> > The problem is that my final_table is a large (and partitioned) table and
> > this query is taking a long time to execute.
> >
> > Someone have any idea (really guys anything would be great) how to solve
> > this situation? I need to ignore duplicates instead to have some error.
> >
> > I'm using* PostgreSQL 9.4* so I can't use "ON CONFLICT" and upgrade is
> not
> > an option.
> >
>
> Explain plan would probably shed some light, but I suspect your
> performance is being heavily hit by the sub query. Distinct is an
> expensive operation and you are performing it once for every distinct row
> in your temp table.
>
> It isn't clear what the primary key is for your final table - name +
> document seems suspicious given these seem to be the only two columns
> your inserting as well. You don't indicate what the data types are
> either - it document is something like 'text' then using it in a
> distinct clause is likely to have huge performance impact.
>
> The first thing I'd do is to eliminate duplicates from your temp table
> as a separate statement or by pre-filtering the CSV before import. I
> would then try something like an outer join to identify rows in your
> temp table which don't exist in your final table and select from there
> to insert into the final table. You don't really need the distinct in
> the sub query as all you really need to know is if (name, document)
> exists - it doesn't matter if more than one exists (for this test).
>
> If you really don't have something more specific for a primary key,
> depending on what data type 'document' is and how large it is, you may
> find adding a column which is a checksum of your 'document' field a
> useful addition. I have done this in the past where I had an application
> where name was not unique and we only wanted distinct instances of
> 'document' (document was a fairly large XML document in this case).
>
> --
> Tim Cross
>
>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Leandro Guimarães 2019-06-15 02:26:33 Re: Copy Bulk Ignore Duplicated
Previous Message Tim Cross 2019-06-14 22:38:49 Re: Copy Bulk Ignore Duplicated