Re: BUG #10189: Limit in 9.3.4 no longer works when ordering using a composite multi-type index

From: Nick Rupley <nickr(at)mirthcorp(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: nickr(at)mirth(dot)com, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #10189: Limit in 9.3.4 no longer works when ordering using a composite multi-type index
Date: 2014-05-02 15:42:41
Message-ID: CAMi1eSFZxAhJRwakc+CfWv8xn1smSR93+Fcmh5Q_4kLUzTNWLQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Thanks again. So as far as recovery goes, that appears to be working. We
actually wrote a PL/pgSQL script to drop constraints/indices on the table,
create a copy of the table, and insert from the old table into the copy.
Then we could programmatically delete all duplicates, rename the copy back
to the original ("_old" "_new" pattern), and then add the
indices/constraints back. Works like a charm.

You were right, this affected other tables as well. We actually did suffer
from some crashes fairly recently. At first since we thought it was a
general corruption issue, we expected to find such problems scattered
across all our tables. However it was confined to a very specific set of
tables (the d_mm*, having the td I posted earlier).

Not only that, but when we went to find duplicates after copying the table,
we found that all of them showed the same pattern:

message_id | id | received_date | status | send_attempts |
response_date |
------------+----+-------------------------------+--------+---------------+----------------------------+
268752 | 0 | 2014-04-17 15:01:12.786786000 | T | 0 |
|
268752 | 0 | 2014-04-17 15:01:12.786786000 | R | 1 |
2014-04-17 15:01:12.795+00 |

As a reminder, the primary key on that table is (message_id, id). Just to
give a little background, that record with the status "R" is inserted
first, and there's no send attempts or response date yet. Then
asynchronously / concurrently, two things can happen:

- The row is updated, setting the send attempts and response date
- The row is updated, setting the status to T, and at the same time a FK
referencing row is inserted in the associated d_mc table.

We also tracked down every single duplicate entry and noted the time it got
inserted. What we found is that this issue only ever started happening
after we upgraded from 9.0.13 to 9.3.4. The upgrade itself doesn't seem to
be the culprit, but messages received *after* the upgrade encountered this
issue.

Looking at these articles:

http://www.postgresql.org/about/news/1511/
http://www.postgresql.org/docs/9.3/static/release-9-3-4.html

Those describe pretty much *exactly* what we've been seeing. To recap:

- Our server recovered from a system crash recently, a few of them
actually
- We were getting inconsistent query results (no rows returned versus
rows returned), and the difference in results matched the query plans,
since one used an index scan, and the other used a sequential scan with a
ROW/SORT instead. The query plans I posted earlier show this.
- Subsequent processing did indeed result in constraint violations,
since conflicting rows were inserted. In our case, the message_id and id
should comprise the primary key, but we were actually seeing duplicate rows
with that same pair, but different statuses (R vs. T).
- The row in the d_mm table (status R, id=0) is referenced by a FK in
the d_mc table, and it can indeed be updated concurrently with the creation
of the referencing row in d_mc.

Now from the news post it sounds like this issue should have been fixed in
9.3.4. We were on 9.0.13 with the same table structure, same mid to
high-volume inserts/updates, and have never once encountered this, until we
upgraded to 9.3.4. The update press release is a bit confusing, it says "The
PostgreSQL Global Development Group has released an update to all supported
version of the database system, including versions 9.3.4, 9.2.8, 9.1.13,
9.0.17, and 8.4.21." So one may interpret that as either "9.3.4 is the
update" or "this is an update TO 9.3.4". I figure it's the former, but
correct me if I'm wrong.

Thanks in advance for any input, we're trying to make a self-contained test
case in the meantime...

-Nick

On Wed, Apr 30, 2014 at 4:52 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Nick Rupley <nickr(at)mirthcorp(dot)com> writes:
> > Thanks Tom. We attempted to reindex:
> > mirthdb=# reindex index d_mm65_index1;
>
> > ERROR: failed to find parent tuple for heap-only tuple at (14808,2) in
> > table "d_mm65"
>
> Ugh :-(. That matches the symptoms of one of the known data-corruption
> bugs in 9.3.x, specifically the business about mis-replay of the WAL entry
> for locking a tuple. (I suppose this database has either suffered some
> crashes, or been promoted from a replication slave?) The bug is fixed as
> of 9.3.4, but the fix doesn't magically cure existing corruption :-(
>
> You can look at the specific row involved here with
>
> select * from d_mm65 where ctid = '(14808,2)';
>
> Probably the best recovery strategy is to DELETE that row by ctid,
> then reinsert the same data. Lather, rinse, repeat until you can reindex
> successfully. Better check your other tables too.
>
> regards, tom lane
>

--
CONFIDENTIALITY NOTICE: The information contained in this electronic
transmission may be confidential. If you are not an intended recipient, be
aware that any disclosure, copying, distribution or use of the information
contained in this transmission is prohibited and may be unlawful. If you
have received this transmission in error, please notify us by email reply
and then erase it from your computer system.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2014-05-02 18:23:50 Re: BUG #10189: Limit in 9.3.4 no longer works when ordering using a composite multi-type index
Previous Message 德哥 2014-05-02 00:13:49 Re: BUG #10194: Stable function in select clause cann't be optimized to one call?