Re: Fixing or Mitigating this ERROR: invalid page in block 35217 of relation base/16421/3192429

From: Chris Travers <chris(dot)travers(at)gmail(dot)com>
To: aqoyyuum(at)cardaccess(dot)com(dot)bn
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Fixing or Mitigating this ERROR: invalid page in block 35217 of relation base/16421/3192429
Date: 2023-11-29 11:37:00
Message-ID: CAKt_ZfsnFvchw3=4XZr8-0ObOqhZvmULtGqVKyyK+6hQJjSDpw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Nov 29, 2023 at 4:36 PM Abdul Qoyyuum <aqoyyuum(at)cardaccess(dot)com(dot)bn>
wrote:

> Hi all,
>
> Knowing that it's a data corruption issue, the only way to fix this is to
> vacuum and reindex the database. What was suggested was the following:
>
> SET zero_damaged_pages = 0; # This is so that we can have the application
> to continue to run
> VACUUM FULL VERBOSE ANALYSE; # Do a full vacuum and analyse the problem if
> possible.
> REINDEX DATABASE "core"; # Then do a reindex and clean it up.
>

So first, to clear up some confusion on my part here:

This procedure doesn't make a lot of sense to me. But did it clear up the
issue?

In any of these cases, it is extremely important to diagnose the system
properly. If you have a fault in your storage device or RAID controller,
for example, you are asking for more corruption and data loss later.

At first I thought maybe you mistyped something and then realized there
were a few issues with the process so it actually didn't make sense.

First, zero_damaged_pages defaults to 0, and I can think of no reason to
set it explicitly.
Secondly, a vacuum full has to reindex, so there is no reason to do a
reindex following. Your whole procedure is limited to a vacuum full, when
a reindex is the only part that could affect this. If it did work,
reindexing is the only part that would have been helpful.

On to the question of what to do next....

>
> We're on Postgresql 12. This has worked before it happened (almost exactly
> a year ago) and I think this needs a more permanent solution. I've looked
> at routine vacuuming and checked the autovacuum is set to on and the
> following configurations:
>
> core=> select name, setting from pg_settings where name like 'autovacuum%';
> name | setting
> -------------------------------------+-----------
> autovacuum | on
> autovacuum_analyze_scale_factor | 0.1
> autovacuum_analyze_threshold | 50
> autovacuum_freeze_max_age | 200000000
> autovacuum_max_workers | 3
> autovacuum_multixact_freeze_max_age | 400000000
> autovacuum_naptime | 60
> autovacuum_vacuum_cost_delay | 2
> autovacuum_vacuum_cost_limit | -1
> autovacuum_vacuum_scale_factor | 0.2
> autovacuum_vacuum_threshold | 50
> autovacuum_work_mem | -1
> (12 rows)
>
> Can anyone advise if there's anything else we can do? We have no clue what
> causes the invalid page block and we are running a High Availability
> cluster set up but we are hoping that there may be a way to mitigate it.
>
>
You need to figure out why the corruption is happening. This is most
likely, in my experience, not a PostgreSQL bug, but usually something that
happens on the hardware layer or an environmental factor. It could be
failin storage or CPU. Or it could be something like bad electrical input
or insufficient cooling (I have seen index and even table corruption issues
from both of these).

If this is a server you run, the first things I would check are:
1. Is there a good-quality UPS that the server is plugged into? Are the
batteries in good working order?
2. Is the server somewhere that may be sitting in a pocket of hot air?

Once you have ruled these out, the next things to check are CPU, memory,
and storage health. Unfortunately checking these is harder but you can
check SMART indications, and other diagnostic indicators.

However, once these errors start happening, you are in danger territory and
need to find out why (and correct the underlying problem) before you get
data loss.

--
Best Wishes,
Chris Travers

Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor
lock-in.
http://www.efficito.com/learn_more

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Davin Shearer 2023-11-29 15:32:54 Re: Emitting JSON to file using COPY TO
Previous Message Stephen Frost 2023-11-29 09:52:58 Re: Fixing or Mitigating this ERROR: invalid page in block 35217 of relation base/16421/3192429