Quick Links

Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

From:	"Wood, Dan" <hexpert(at)amazon(dot)com>
To:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	"Wong, Yi Wen" <yiwong(at)amazon(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
Date:	2017-10-09 06:11:16
Message-ID:	835EB743-A57A-49AE-ADA7-A16ACCB737D7@amazon.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-committers pgsql-hackers

I’m unclear on what is being repro’d in 9.6. Are you getting the duplicate rows problem or just the reindex problem? Are you testing with asserts enabled(I’m not)?

If you are getting the dup rows consider the code in the block in heapam.c that starts with the comment “replace multi by update xid”.
When I repro this I find that MultiXactIdGetUpdateXid() returns 0. There is an updater in the multixact array however the status is MultiXactStatusForNoKeyUpdate and not MultiXactStatusNoKeyUpdate. I assume this is a preliminary status before the following row in the hot chain has it’s multixact set to NoKeyUpdate.

Since a 0 is returned this does precede cutoff_xid and TransactionIdDidCommit(0) will return false. This ends up aborting the multixact on the row even though the real xid is committed. This sets XMAX to 0 and that row becomes visible as one of the dups. Interestingly the real xid of the updater is 122944 and the cutoff_xid is 122945.

I’m still debugging but I start late so I’m passing this incomplete info along now.

On 10/7/17, 4:25 PM, "Alvaro Herrera" <alvherre(at)alvh(dot)no-ip(dot)org> wrote:

Peter Geoghegan wrote:
> On Sat, Oct 7, 2017 at 1:31 AM, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
> >> As you must have seen, Alvaro said he has a variant of Dan's original
> >> script that demonstrates that a problem remains, at least on 9.6+,
> >> even with today's fix. I think it's the stress-test that plays with
> >> fillfactor, many clients, etc [1].
> >
> > I just execute setup.sql once and then run this shell command,
> >
> > while :; do
> > psql -e -P pager=off -f ./repro.sql
> > for i in `seq 1 5`; do
> > psql -P pager=off -e --no-psqlrc -f ./lock.sql &
> > done
> > wait && psql -P pager=off -e --no-psqlrc -f ./reindex.sql
> > psql -P pager=off -e --no-psqlrc -f ./report.sql
> > echo "done"
> > done
>
> I cannot reproduce the problem on my personal machine using this
> script/stress-test. I tried to do so on the master branch git tip.
> This reinforces the theory that there is some timing sensitivity,
> because the remaining race condition is very narrow.

Hmm, I think I added a random sleep (max. 100ms) right after the
HeapTupleSatisfiesVacuum call in vacuumlazy.c (lazy_scan_heap), and that
makes the race easier to hit.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple at 2017-10-07 23:25:24 from Alvaro Herrera

Responses

Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple at 2017-10-10 14:14:44 from Alvaro Herrera

Browse pgsql-committers by date

	From	Date	Subject
Next Message	Peter Eisentraut	2017-10-09 11:51:11	pgsql: Remove unused documentation file
Previous Message	Andres Freund	2017-10-08 22:09:48	pgsql: Reduce memory usage of targetlist SRFs.

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Konstantin Knizhnik	2017-10-09 07:37:01	Re: Slow synchronous logical replication
Previous Message	Ashutosh Bapat	2017-10-09 06:05:24	Re: Partition-wise join for join between (declaratively) partitioned tables