Re: Index corruption with CREATE INDEX CONCURRENTLY

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Index corruption with CREATE INDEX CONCURRENTLY
Date: 2017-02-02 12:44:57
Message-ID: CAA4eK1JYkLuRJa_7aG-PHNA+bFBNoJBW1+8nbXv+T00gGaOX_g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 30, 2017 at 7:20 PM, Pavan Deolasee
<pavan(dot)deolasee(at)gmail(dot)com> wrote:
> Hello All,
>
> While stress testing WARM, I encountered a data consistency problem. It
> happens when index is built concurrently. What I thought to be a WARM
> induced bug and it took me significant amount of investigation to finally
> conclude that it's a bug in the master. In fact, we tested all the way up to
> 9.2 release and the bug exists in all releases.
>
> In my test set up, I keep a UNIQUE index on an "aid" column and then have
> many other "aid[1-4]" columns, on which indexes are built and dropped. The
> script then updates a randomly selected row using any of the aid[1-4]
> columns. Even these columns are updated during the tests to force WARM
> updates, but within a narrow range of non-overlapping values. So we can
> always find the unique row using some kind of range condition. The
> reproduction case is much simpler and I'll explain that below.
>
> In the reproduction scenario (attached), it happens fairly quickly, may be
> after 3-10 rounds of DROP/CREATE. You may need to adjust the -j value if it
> does not happen for you even after a few rounds. Just for the record, I
> haven't checked past reports to correlate if the bug explains any of the
> index corruption issues we might have received.
>
> To give a short summary of how CIC works with HOT, it consists of three
> phases:
> 1. Create index entry in the catalog, mark it not ready for inserts, commit
> the transaction, start new transaction and then wait for all potential lock
> holders on the table to end
> 2. Take a MVCC snapshot, build the index, mark index ready for inserts,
> commit and then once again wait for potential lock holders on the table to
> end
> 3. Do a validation phase where we look at MVCC-visible tuples and add
> missing entries to the index. We only look at the root line pointer of each
> tuple while deciding whether a particular tuple already exists in the index
> or not. IOW we look at HOT chains and not tuples themselves.
>
> The interesting case is phase 1 and 2. The phase 2 works on the premise that
> every transaction started after we finished waiting for all existing
> transactions will definitely see the new index. All these new transactions
> then look at the columns used by the new index and consider them while
> deciding HOT-safety of updates. Now for reasons which I don't fully
> understand, there exists a path where a transaction starts after CIC waited
> and took its snapshot at the start of the second phase, but still fails to
> see the new index. Or may be it sees the index, but fails to update its
> rd_indexattr list to take into account the columns of the new index. That
> leads to wrong decisions and we end up with a broken HOT chain, something
> the CIC algorithm is not prepared to handle. This in turn leads the new
> index missing entries for the update tuples i.e. the index may have aid1 =
> 10, but the value in the heap could be aid1 = 11.
>
> Based on my investigation so far and the evidence that I've collected, what
> probably happens is that after a relcache invalidation arrives at the new
> transaction, it recomputes the rd_indexattr but based on the old, cached
> rd_indexlist. So the rd_indexattr does not include the new columns. Later
> rd_indexlist is updated, but the rd_indexattr remains what it was.
>
> There is definitely an evidence of this sequence of events (collected after
> sprinkling code with elog messages), but it's not yet clear to me why it
> affects only a small percentage of transactions. One possibility is that it
> probably depends on at what stage an invalidation is received and processed
> by the backend. I find it a bit confusing that even though rd_indexattr is
> tightly coupled to the rd_indexlist, we don't invalidate or recompute them
> together. Shouldn't we do that?
>

During invalidation processing, we do destroy them together in
function RelationDestroyRelation(). So ideally, after invalidation
processing, both should be built again, but not sure why that is not
happening in this case.

> For example, in the attached patch we
> invalidate rd_indexattr whenever we recompute rd_indexlist (i.e. if we see
> that rd_indexvalid is 0).

/*
+ * If the index list was invalidated, we better also invalidate the index
+ * attribute list (which should automatically invalidate other attributes
+ * such as primary key and replica identity)
+ */

+ relation->rd_indexattr = NULL;
+

I think setting directly to NULL will leak the memory.

> This patch fixes the problem for me (I've only
> tried master), but I am not sure if this is a correct fix.
>

I think it is difficult to say that fix is correct unless there is a
clear explanation of what led to this problem. Another angle to
investigate is to find out when relation->rd_indexvalid is set to 0
for the particular session where you are seeing the problem. I could
see few places where we are setting it to 0 and clearing rd_indexlist,
but not rd_indexattr. I am not sure whether that has anything to do
with the problem you are seeing.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavan Deolasee 2017-02-02 12:47:46 Re: Patch: Write Amplification Reduction Method (WARM)
Previous Message Kyotaro HORIGUCHI 2017-02-02 12:26:27 Re: What is "index returned tuples in wrong order" for recheck supposed to guard against?