From: | Noah Misch <noah(at)leadboat(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Geoghegan <pg(at)bowt(dot)ie>, Andres Freund <andres(at)anarazel(dot)de> |
Subject: | Re: conchuela timeouts since 2021-10-09 system upgrade |
Date: | 2021-10-26 01:51:57 |
Message-ID: | 20211026015157.GA113335@rfd.leadboat.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Mon, Oct 25, 2021 at 04:59:42PM -0400, Tom Lane wrote:
> Andrey Borodin <x4mmm(at)yandex-team(dot)ru> writes:
> > FWIW it's easy to make the issue reproduce faster with following diff
> > - '--no-vacuum --client=1 --transactions=100',
> > + '--no-vacuum --client=1 --transactions=1',
>
> Hmm, didn't help here. It seems that even though prairiedog managed to
> fail on its first attempt, it's not terribly reproducible there; I've
> seen only one failure in about 30 manual attempts. In the one failure,
> the non-background pgbench completed fine (as determined by counting
> statements in the postmaster's log); but the background one had only
> finished about 90 transactions before seemingly getting stuck. No new
> SQL commands had been issued after about 10 seconds.
Interesting.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2021-10-24%2016%3A05%3A58
also shows a short command count, just 131/200 completed. However,
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-10-25%2000%3A35%3A27
shows the full 200/200. I'm starting to think the prairiedog failures have
only superficial similarity to the conchuela failures.
> Nonetheless, I have a theory and a proposal. This coding pattern
> seems pretty silly:
>
> $pgbench_h->pump_nb;
> $pgbench_h->finish();
>
> ISTM that if you need to call pump at all, you need a loop not just
> one call. So I'm guessing that when it fails, it's for lack of
> pumping.
The pump_nb() is just unnecessary. We've not added anything destined for
stdin, and finish() takes care of pumping outputs.
> The other thing I noticed is that at least on prairiedog's host, the
> number of invocations of the DROP/CREATE/bt_index_check transaction
> is ridiculously out of proportion to the number of invocations of the
> other transactions. It can only get through seven or eight iterations
> of the index transaction before the other transactions are all done,
> which means the last 190 iterations of that transaction are a complete
> waste of cycles.
That makes sense.
> What I think we should do in these two tests is nuke the use of
> background_pgbench entirely; that looks like a solution in search
> of a problem, and it seems unnecessary here. Why not run
> the DROP/CREATE/bt_index_check transaction as one of three script
> options in the main pgbench run?
The author tried that and got deadlocks:
https://postgr.es/m/5E041A70-4946-489C-9B6D-764DF627A92D@yandex-team.ru
On prairiedog, the proximate trouble is pgbench getting stuck. IPC::Run is
behaving normally given a stuck pgbench. When pgbench stops sending queries,
does pg_stat_activity show anything at all running? If so, are those backends
waiting on locks? If not, what's the pgbench stack trace at that time?
From | Date | Subject | |
---|---|---|---|
Next Message | Masahiko Sawada | 2021-10-26 05:06:23 | Re: Logical replication - empty search_path bug? |
Previous Message | PG Bug reporting form | 2021-10-25 23:29:23 | BUG #17247: How to avoid crating multiple Foreign keys on same column on same table. |