From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution. |
Date: | 2016-02-11 15:42:23 |
Message-ID: | 14581.1455205343@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-committers pgsql-hackers |
Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> That would be great. Taking a look at what happened, I have a feeling
>> this may be a race condition of some kind in the isolation tester. It
>> seems to have failed to recognize that a1 started waiting, and that
>> caused the "deadlock detected" message to reported differently. I'm
>> not immediately sure what to do about that.
> Yeah, so: try_complete_step() waits 10ms, and if it still hasn't
> gotten any data back from the server, then it uses a separate query to
> see whether the step in question is waiting on a lock. So what
> must've happened here is that it took more than 10ms for the process
> to show up as waiting in pg_stat_activity.
No, because the machines that are failing are showing a "<waiting ...>"
annotation that your reference output *doesn't* have. I think what is
actually happening is that these machines are seeing the process as
waiting and reporting it, whereas on your machine the backend detects
the deadlock and completes the query (with an error) before
isolationtester realizes that the process is waiting.
It would probably help if you didn't do this:
setup { BEGIN; SET deadlock_timeout = '10ms'; }
which pretty much guarantees that there is a race condition: you've set it
so that the deadlock detector will run at approximately the same time when
isolationtester will be probing the state. I'm surprised that it seemed
to act consistently for you. I would suggest putting all the other
sessions to deadlock_timeout of 100s and the one you want to fail to
timeout of ~ 5s. That will mean that the "<waiting ...>" output should
show up pretty reliably even on overloaded buildfarm critters.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2016-02-11 16:31:04 | pgsql: Code review for isolationtester changes. |
Previous Message | Teodor Sigaev | 2016-02-11 15:11:26 | pgsql: Improve error reporting in format() |
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2016-02-11 15:55:27 | Re: max_parallel_degree context level |
Previous Message | Simon Riggs | 2016-02-11 15:32:41 | Re: max_parallel_degree context level |