Re: conchuela timeouts since 2021-10-09 system upgrade

From: Noah Misch <noah(at)leadboat(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Geoghegan <pg(at)bowt(dot)ie>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: conchuela timeouts since 2021-10-09 system upgrade
Date: 2021-10-26 01:51:57
Message-ID: 20211026015157.GA113335@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Mon, Oct 25, 2021 at 04:59:42PM -0400, Tom Lane wrote:
> Andrey Borodin <x4mmm(at)yandex-team(dot)ru> writes:
> > FWIW it's easy to make the issue reproduce faster with following diff
> > - '--no-vacuum --client=1 --transactions=100',
> > + '--no-vacuum --client=1 --transactions=1',
>
> Hmm, didn't help here. It seems that even though prairiedog managed to
> fail on its first attempt, it's not terribly reproducible there; I've
> seen only one failure in about 30 manual attempts. In the one failure,
> the non-background pgbench completed fine (as determined by counting
> statements in the postmaster's log); but the background one had only
> finished about 90 transactions before seemingly getting stuck. No new
> SQL commands had been issued after about 10 seconds.

Interesting.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2021-10-24%2016%3A05%3A58
also shows a short command count, just 131/200 completed. However,
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-10-25%2000%3A35%3A27
shows the full 200/200. I'm starting to think the prairiedog failures have
only superficial similarity to the conchuela failures.

> Nonetheless, I have a theory and a proposal. This coding pattern
> seems pretty silly:
>
> $pgbench_h->pump_nb;
> $pgbench_h->finish();
>
> ISTM that if you need to call pump at all, you need a loop not just
> one call. So I'm guessing that when it fails, it's for lack of
> pumping.

The pump_nb() is just unnecessary. We've not added anything destined for
stdin, and finish() takes care of pumping outputs.

> The other thing I noticed is that at least on prairiedog's host, the
> number of invocations of the DROP/CREATE/bt_index_check transaction
> is ridiculously out of proportion to the number of invocations of the
> other transactions. It can only get through seven or eight iterations
> of the index transaction before the other transactions are all done,
> which means the last 190 iterations of that transaction are a complete
> waste of cycles.

That makes sense.

> What I think we should do in these two tests is nuke the use of
> background_pgbench entirely; that looks like a solution in search
> of a problem, and it seems unnecessary here. Why not run
> the DROP/CREATE/bt_index_check transaction as one of three script
> options in the main pgbench run?

The author tried that and got deadlocks:
https://postgr.es/m/5E041A70-4946-489C-9B6D-764DF627A92D@yandex-team.ru

On prairiedog, the proximate trouble is pgbench getting stuck. IPC::Run is
behaving normally given a stuck pgbench. When pgbench stops sending queries,
does pg_stat_activity show anything at all running? If so, are those backends
waiting on locks? If not, what's the pgbench stack trace at that time?

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Masahiko Sawada 2021-10-26 05:06:23 Re: Logical replication - empty search_path bug?
Previous Message PG Bug reporting form 2021-10-25 23:29:23 BUG #17247: How to avoid crating multiple Foreign keys on same column on same table.