errors with high connections rate

From: Pawel Veselov <pawel(dot)veselov(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: errors with high connections rate
Date: 2012-07-03 07:19:24
Message-ID: CAMnJ+Beq0hCBuTY_=nz=ru0U-No543_RAEunLVSAYU8tugd6NA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi.

-- problem 1 --

I have an application, using libpq, connecting to postgres 9.1.3 (Amazon
AMI distro).
The application writes data at a high rate (at this point it's 500
transaction per second), using multiple threads (at this point it's 800).

These are "worker" threads, that receive "messages" that are then written
out to the DB. There is no connection pool, instead, each worker thread
maintains it's own connection that it uses to write data to the database.
The connections are kept pthread's "specific" data blocks.

Each thread would connect to the DB when the first work message is
received, or when there was an "error" flag with a connection. The error
flag is set any time there is any error running a database statement.

When the work is "slow", I don't see any problem (slow was ~250 messages
per second). As I increased the load, when I restart the process, threads
start grabbing work at high enough rate, and each will first open a
connection to the database, and these errors start popping up:

Can't connect to DB: could not send data to server: Transport endpoint is
not connected
could not send startup packet: Transport endpoint is not connected

This is a result of executing the following code:

wi->pg_conn = PQconnectdb(conn_str);
ConnectionStatusType cst = PQstatus(wi->pg_conn);

if (cst != CONNECTION_OK) {
ERR("Can't connect to DB: %s\n", PQerrorMessage(wi->pg_conn));
}

Eventually, the errors go away (when the worker thread fail to connect,
they just pass the message to another thread, and wait for their turn, and
will try reconnecting again), so it does seem that the remedy is just
spreading the connections in time.

The connection string is '' (empty), the connection is made through
/tmp/.s.PGSQL.5432

I don't see these errors when:
1) the amount of worker threads is reduced (could never reproduce it under
200 or less, but seen them with 300 and more)
2) the amount of load is reduced

-- problem 2 --

As I'm trying to debug this (with strace), I could never reproduce it, at
least to see what's going on, but sometimes I get another error : "too many
users connected". Even restarting postmaster doesn't help. The postmaster
is running with -N810, and the role has connection limit of 1000. Yet, the
"too many" error starts creeping up only after 275 connections are opened
(counted by successful connect() from strace).

Any idea where should I dig?

P.S. I looked at fe-connect.c, I'm wondering if there a potential race
condition between poll() and socket actually finishing the connection? If
running under strace, I never see EINPROGRESS returned from connect(), and
the only reason sendto() would result into ENOTCONN is when the connect
didn't finish, and the socket was deemed "connected" using
poll/getsockopt...

Thanks,
Pawel.

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Craig Ringer 2012-07-03 07:34:44 Re: errors with high connections rate
Previous Message Alban Hertroys 2012-07-03 06:40:46 Re: Query ordering question