Re: Violation of primary key constraint

From: Toby Murray <toby(dot)murray(at)gmail(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Violation of primary key constraint
Date: 2013-02-01 07:34:06
Message-ID: CAJeqKgug0bSbf8GtZYr=zS_sBDe32i42xR+mH-6YtriaXn1wWw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Jan 31, 2013 at 11:21 PM, Toby Murray <toby(dot)murray(at)gmail(dot)com> wrote:
> Also, the node change happened on January 28th. To be precise, the
> timestamp of the node is 2013-01-28 02:38:29. So it is looking like
> something may have gone wrong, possibly in a single minutely update,
> on January 28th. I'll do a little more digging.

I can confirm that all 4 of these errors can be accounted for in two
back to back minutely updates at 8:38 and 8:39 on January 28th:
http://planet.openstreetmap.org/replication/minute/000/198/193.osc.gz
http://planet.openstreetmap.org/replication/minute/000/198/194.osc.gz

It is possible that both of them were applied in a single transaction.
Sometimes replication falls behind by a few seconds and misses the
minutely update and then the next run picks up 2 minutes and combines
them before updating the database. However one of the queries from
earlier seemed to indicate that two different transactions were
involved.

So with a specific timeframe in mind, I looked at other things. It was
2:30 AM on a Monday morning here and it looks like I was in bed at the
time. I don't see any errors in log files at that time... however
about 6 hours later I did get my first notification from smartd of an
unreadable sector on one of my drives and a failed SMART self test.
This is one of the drives in the RAID5 array so even if it failed, it
should (in theory of course) not have corrupted anything. Also, the
affected table data lives on an SSD that is not part of the array. The
index is on the array but it sounds like this has been ruled out as a
cause of error.

We were discussing the possibility of hardware failure on IRC. If
something went wrong on the SSD, it sounds like it would have had to
have been a completely silent write failure. This is of course always
an option although I haven't seen any other indications that this
drive is failing. It is pretty new at 1,200 power on hours and a wear
leveling count of 5.

Toby

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message QuiverNAY 2013-02-01 11:03:32 BUG #7843: Incorrect using icacls
Previous Message Toby Murray 2013-02-01 05:21:45 Re: Violation of primary key constraint