Re: Duplicate data despite unique constraint

From: Daniel Caldeweyher <dcalde(at)gmail(dot)com>
To: Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>
Cc: "Jonas Tehler *EXTERN*" <jonas(at)tehler(dot)se>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Duplicate data despite unique constraint
Date: 2016-09-11 04:36:41
Message-ID: CADVnD3DPpXCp7Aryk7_LwoiDgKFsmSms1bqdQuqGyC6T68othw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I had a similar issue once and was able to recover from it. If this affects
only some rows and you are able to identify them, this is fixable:

--force table scan to skip using corrupt index
set enable_seqscan=1
set enable_indexscan=0
set enable_bitmapscan=0
select email,count(*)
from users
group by email
having count(*) > 1;

Then, if the rows are simply just duplicates and have no other changes, add
a new serial column (or to prevent blocking, add a bigint column and update
with sequential values), then using the emails from above, delete the ones
with the higher/lower sequence number. Ensure you are still skipping
indexes.

Once the table is clean, drop the sequence column again and re-index.

Hope this helps,
Daniel

On Fri, Sep 2, 2016 at 11:06 PM, Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>
wrote:

> Jonas Tehler wrote:
> > We’re running Postgresql 9.4.5 on Amazon RDS. One of our tables looks
> something like this:
> >
> >
> > CREATE TABLE users
> > (
> > ...
> > email character varying(128) NOT NULL,
> > ...
> > CONSTRAINT users_email_key UNIQUE (email)
> > )
> >
> > Despite this we have rows with very similar email values. I discovered
> the problem when I tried to add
> > a column and got the following error:
> >
> > ActiveRecord::RecordNotUnique: PG::UniqueViolation: ERROR: could not
> create unique index
> > "users_email_key"
> > DETAIL: Key (email)=(xxx(at)yyy(dot)com) is duplicated.
> > : ALTER TABLE "users" ADD COLUMN "client_branch" character varying(255)
> DEFAULT ‘beta'
> >
> >
> > Now look at this:
> >
> > => select email from users where email = 'xxx(at)yyy(dot)com';
> > email
> > ---------------------------
> > xxx(at)yyy(dot)com
> > (1 row)
> >
> >
> > => select email from users where email LIKE 'xxx(at)yyy(dot)com';
> > email
> > ---------------------------
> > xxx(at)yyy(dot)com
> > xxx(at)yyy(dot)com
> > (2 rows)
> >
> >
> > I have tried to compare the binary data in various ways, email::bytes,
> md5(email),
> > encode(email::bytea, 'hex’), char_length(email) and it all looks the
> same for both rows.
> >
> > Any suggestions how I can discover the difference between the values and
> how they could have been
> > added without triggering the constraint? I know that the values were
> added after the constraint was
> > added.
> >
> > The data was added from a Ruby on Rails app that also has unique
> constraints on the email field and
> > validation on the email format.
>
> That looks very much like data corruption.
>
> I guess there is an index on "users" that is used for one query but not
> the other.
> Can you verify with EXPLAIN?
>
> Assuming that it is a 'text_ops' or 'varchar_ops' index, I'd say it gets
> used for the first
> query, but not for the second. That would mean that there is an extra
> entry in the table that
> is not in the index.
>
> Did you have any crashes, standby promotion, restore with PITR or other
> unusual occurrences recently?
>
> Make sure you have a physical backup; there may be other things corrupted.
>
> This is a possible path to proceed:
>
> Once you have made sure that you have a physical backup, try to add the
> "ctid" column to both queries.
>
> Then delete the extra row from the second query with "DELETE FROM email
> WHERE ctid = ...".
>
> Then, to make sure there is no other corruption lurking, make a logical
> backup
> with pg_dumpall, create a new database cluster, create a new one with
> "initdb" and
> restore the data.
>
> Yours,
> Laurenz Albe
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message dandl 2016-09-11 05:20:30 Re: What limits Postgres performance when the whole database lives in cache?
Previous Message Jim Nasby 2016-09-11 02:34:36 Re: Duplicate data despite unique constraint