VACUUM: Nonremovable rows due to wal sender process

From: Steve Nixon <inverasln(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: VACUUM: Nonremovable rows due to wal sender process
Date: 2022-01-04 17:16:00
Message-ID: CACSoXPLHBDObjLFZ2MGpjvw+g7qXNpKPG+h0wdkh1-PT2pWWjg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I have PostgreSQL Version 10.7 on AIX 7.1 set up with streaming
replication. Replication appears to be working fine and database contents
are staying current.

*ps -ef |grep sender*
> postgres 54854022 30212254 0 10:10:29 - 0:00 postgres: wal sender
> process postgres 10.253.15.123(47852) streaming 54/BB631A30
>

> *ps -ef |grep receiver*
> postgres 34079622 9897420 0 10:10:29 - 0:00 postgres: wal
> receiver process streaming 54/BB631A30

The problem I have is related to the wal sender process. The AUTOVACUUM and
VACUUM are not cleaning up dead tuples in the tables because it is
reporting that they are "nonremovable" due to the backend_xmin that is not
changing. This has resulted in queries on some tables taking seconds or
minutes to return under 100 tuples that should take 5ms or less.

*VACUUM VERBOSE scttlk_tbl;*

INFO: "scttlk_tbl": found 0 removable, 149715 nonremovable row versions in
3322
out of 12152 pages
DETAIL: 149699 dead row versions cannot be removed yet, oldest xmin:
340818216
There were 21246 unused item pointers.
Skipped 0 pages due to buffer pins, 8830 frozen pages.

When I check the backend_xmin that is indicated as preventing the dead
tuples from being removed, the PID it points to is the wal sender.

*SELECT pid, datname, usename, state, backend_xid, backend_xminFROM
pg_stat_activity WHERE backend_xmin = 340818216;*

pid | datname | usename | state | backend_xid | backend_xmin
----------+--------------+----------+--------+-------------+--------------
54854022 | | postgres | active | | 340818216

I have determined that if I shut down the replication database, the wal
sender process will shut down. When I do this and run my VACUUM, it is then
able to remove the dead tuples that were nonremovable prior. However, when
I restart the replication database, the wal sender becomes active again and
tries to pick up where it left off, at the same backend_xmin.

I believe the issue may be related to another product we are using as part
of the replication process called "Attunity". But we have shut that down
and restarted it to make sure it did not have any long running queries or
other hooks that may be affecting the wal sender and preventing the
backend_xmin from moving forward. It just does not seem to do so.

My questions are as follows:

1) Is there anything I can do short of shutting down and restarting the
primary (production system) that would allow the backend_xmin to move
forward?

2) Is it possible to "kill" the WAL sender process? I know it's possible,
but what I mean is will it crash Postgres doing that? Or will it simply
respawn?

Ultimately, the goal is to get backend_xmin to be caught up to work being
done today and not waiting on something from days or weeks ago to release
so the autovacuum can take place.

Hope I'm explaining myself right! Please let me know any advice you may
have on this, and thanks in advance for any tips on where to look or how to
address this.

Regards,

Steve N.

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Sergei Kornilov 2022-01-04 20:17:52 Re:VACUUM: Nonremovable rows due to wal sender process
Previous Message Zbigniew Kostrzewa 2022-01-03 22:48:14 Re: WAL files keep piling up