Database logins taking longer and longer, showing up as "authentication" in ps(1)

From: Adam Sjøgren <asjo(at)koldfront(dot)dk>
To: pgsql-general(at)postgresql(dot)org
Subject: Database logins taking longer and longer, showing up as "authentication" in ps(1)
Date: 2020-08-18 12:53:53
Message-ID: 87364kdsim.fsf@tullinup.koldfront.dk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

We have a PostgreSQL 11.3¹ running on an Ubuntu 16.04 server, which
sometimes exhibits a behaviour I can't quite understand: simply logging
into the database starts to take minutes to complete.

We have 60 processes (workers) running on different machines accessing
the database, that all grab jobs from a queue and update rows in a table
after doing some calculations (which vary in time from <1s to perhaps a
minute, many of them fast).

Sometimes new database logins slow down, from usually taking <0.05s to
taking minutes. This is for psql as a normal user using Kerberos, for
psql as the postgres superuser, for the web-application logging into the
database, for everything.

What I see in the output of ps(1) is a lot of postgres processes in
state "authentication", when this problem occurs. It "feels" like the
logins are waiting for something - when e.g. psql finally logs in,
running queries seem normal.

The only "special" thing I can think of is that the workers use advisory
locks on rows in the table they are updating (as there can be multiple
jobs in flight at the same time, for the same row).

I can imagine that a pattern of jobs and workers could make the locking
cause those workers to wait and slow each other down, but I don't
understand how it can affect the time it takes for new database logins.

When this happens, if I stop the 60 workers, the login times are back to
normal within seconds, and all the postgres processes showing
"authentication" switch away from that state. If I start the workers
again, the login problem appears within a couple of minutes.

If I start a smaller number of workers, say 5, the problem doesn't
appear.

Could this be lock contention on a database table somehow affecting
logins?

I have tried searching for this problem, and found some older references
to things² that seem to have been fixed in the meantime. Is anybody else
seeing this?

The server is configured with max_connections 1200 and currently has
~400 connections. It has 1.5 TB of memory and 112 cores (including
hyperthreading) - the load is not high when I see this problem, between
20 and 30.

Any ideas and suggestions are welcome - if there is some relevant
information I can provide, I will be happy to try.

Best regards,

Adam

¹ We should plan to upgrade to 11.9, yes.

² E.g. https://www.postgresql.org/message-id/CAJw8uJQGoobesbPCMbxj6Vb4nv9D-GgvZ%2B7pK%2Bfckbb4DqJEAg%40mail.gmail.com
and https://www.postgresql.org/message-id/20140312232810.634.7997%40wrigleys.postgresql.org

--
"We don't care about crashing our cars Adam Sjøgren
Pedal to the metal and our heads among the stars" asjo(at)koldfront(dot)dk

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Stephen Frost 2020-08-18 13:02:20 Re: Point in time recovery
Previous Message Stephen Frost 2020-08-18 12:52:50 Re: "Go" (lang) standard driver