Re: Backup failure Postgres

From: Torsten Förtsch <tfoertsch123(at)gmail(dot)com>
To: Jethish Jethish <jethish777(at)gmail(dot)com>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Backup failure Postgres
Date: 2024-05-23 12:15:47
Message-ID: CAKkG4_nK1DYU=DTe+hb-TQnS5+05hJ1MXTf5QNKJSfsLrZ-7iA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Look, you have to compromise somewhere. Let me explain the problem. PG uses
MVCC. That means if you update or delete rows, rows are not actually
modified or added back to free space. They are just marked for later
removal. That actual removal is VACUUM's task. The reason for doing so is
that a concurrent transaction might still need to see the modified or
deleted row. Now vacuum comes along and wants to actually add things back
to free space. On the master that works fine because the master knows all
concurrent transactions and what they might still need. So, vacuum will
simply skip those rows.

However, that knowledge does not extend to the replica. The master does not
know which transactions are running on the replica. So a vacuum operation
on the master might remove something that's still needed on the replica.
Now, that modification made by vacuum also needs to be replayed on the
replica. The way that works is by adding all modifications including insert
or vacuum or any other change in sequential order to a log (write-ahead-log
or WAL). This log is then simply shipped to the replica and replayed.

It's not difficult to understand that these changes must be replayed in the
same sequential order. Otherwise you get chaos. Now imagine a vacuum
operation at the replica which removes stuff that is still needed by a
transaction running on the replica like your COPY. Now the replica has 2
choices:

- abort the transaction and prefer replaying WAL
- pause replaying WAL and wait for the long running transaction

The 1st case is obviously bad for the transaction. The 2nd choice is bad
for everybody else because WAL can be replayed only in the same order as it
is generated. So, nothing that happened after that vacuum can be replayed
which leads to stale data on the replica.

One way to mitigate this is hot_standby_feedback. That way the replica
tells the master from time to time which old rows it still needs to see.
The drawback of this is that your tables on the master might accumulate
garbage that would normally be removed by vacuum earlier. That can affect
query performance.

Then you have the option to pause WAL replay one or the other way.
max_standby_streaming_delay, disconnecting from the master or explicitly
pausing replay, all fall in that category.

The last option I know of would be to use logical replication. That comes
with other problems. DDL becomes a bit finicky. Initial setup can be
tricky. The process applying the changes can become a bottleneck.

If you are really time-critical and you just want the COPY job to be done
and neither lag nor bloat are acceptable, then maybe you create another
streaming replica, disconnect it from the master, run your COPY job and
destroy the replica. If 3TB is the database size, then that does not look
unsurmountable. Of course, you need the resources. In my environment I'd
estimate 3-4 hours.

If you want a simple solution, then try hot_standby_feedback.

On Thu, May 23, 2024 at 12:46 PM Jethish Jethish <jethish777(at)gmail(dot)com>
wrote:

> Hi Torsten,
>
> I have tried by increasing the max_standby_streaming_delay but I'm facing
> lag issues on the replica server.
>
> When i increase the max_standby_streaming_delay even if a query runs for 2
> minutes I'm facing lag issues for 2 minutes.
>
> Please suggest here.
> Data size is 3TB
>
> On Thu, May 23, 2024, 3:53 PM Torsten Förtsch <tfoertsch123(at)gmail(dot)com>
> wrote:
>
>> As the error message says, your query was aborted due to it conflicting
>> with recovery. There are many ways to deal with that. You could enable
>> hot_standby_feedback on the replica. You could disconnect the replica from
>> the master for the time the COPY takes (reset primary_conninfo). You could
>> increase max_standby_streaming_delay. Perhaps you could also wrap the COPY
>> operation in pg_wal_replay_pause() / pg_wal_replay_resume().
>>
>> On Thu, May 23, 2024 at 11:59 AM Jethish Jethish <jethish777(at)gmail(dot)com>
>> wrote:
>>
>>> I'm frequently facing the below error while performing backup. Someone
>>> please tell how solve this issues.
>>>
>>>
>>> Failed : pg_dump: error: Dumping the contents of table "botsession"
>>> failed: PQgetResult() failed. pg_dump: error: Error message from server:
>>> ERROR: canceling statement due to conflict with recovery DETAIL: User query
>>> might have needed to see row versions that must be removed. pg_dump: error:
>>> The command was: COPY public.botsession (id, userid, data, iscompressed) TO
>>> stdout;
>>>
>>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Alvaro Herrera 2024-05-23 12:31:44 Re: Backup failure Postgres
Previous Message Jethish Jethish 2024-05-23 10:45:58 Re: Backup failure Postgres