Re: Walsender waiting on SnapbuildSync

From: Andres Freund <andres(at)anarazel(dot)de>
To: Brent Kerby <blkerby(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Walsender waiting on SnapbuildSync
Date: 2018-08-05 14:36:41
Message-ID: 20180805143641.mjvgag4ck4wgadgk@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 2018-08-04 13:34:04 -0600, Brent Kerby wrote:
> On Postgres 10.3 (on AWS RDS), I am running logical decoding using the
> test_decoding output plugin, and every few minutes I am seeing pauses in
> the stream, unrelated to any large transactions. About once every hour or
> two, the pause is long enough that the database disconnects my client due
> to exceeding wal_sender_timeout (30 seconds -- the RDS default value);
> after reconnecting it is able to make progress again. My client is using
> the streaming replication protocol via pgjdbc (with a status interval of 1
> second). What I'm seeing is that during such a pause, the server is not
> sending any data to the client:
>
> - pg_stat_replication.sent_lsn stops advancing
> - My client is blocking in a call to PGReplicationStream.read()
> - pg_stat_activity shows that the walsender process has a wait_event of
> 'SnapbuildSync'.
>
> In this scenario, it makes sense that the client would be timed out: pgjdbc
> only sends feedback to the server at the beginning of a call to
> PGReplicationStream.read(), so if a single call blocks a long time, never
> receiving any data from the server, then the client would stop sending
> feedback to the server, causing timeout.
>
> My question is why might the server be spending so much time waiting on
> SnapbuildSync? The docs describe this event as follows:
>
> "IO / SnapbuildSync / Waiting for a serialized historical catalog snapshot
> to reach stable storage."

> I gather that this is related to statistics collection, but I'm not
> understanding why a walsender process would wait on such an event nor why
> it would take such a long time. Any ideas?

It's *not* related to statistics collection. It's fsyncing a few *small*
files to disk. On a first blush this seems to indicate that your storage
system is quite overloaded?

The relevant piece of code is:

/*
* fsync the file before renaming so that even if we crash after this we
* have either a fully valid file or nothing.
*
* TODO: Do the fsync() via checkpoints/restartpoints, doing it here has
* some noticeable overhead since it's performed synchronously during
* decoding?
*/
pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_SYNC);
if (pg_fsync(fd) != 0)
{
int save_errno = errno;

CloseTransientFile(fd);
errno = save_errno;
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not fsync file \"%s\": %m", tmppath)));
}
pgstat_report_wait_end();
CloseTransientFile(fd);

Greetings,

Andres Freund

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Alvaro Herrera 2018-08-05 14:43:29 Re: ALTER TABLE .. SET STATISTICS
Previous Message Ron 2018-08-05 06:03:01 ALTER TABLE .. SET STATISTICS