Re: Fixing or Mitigating this ERROR: invalid page in block 35217 of relation base/16421/3192429

From: Chris Travers <chris(dot)travers(at)gmail(dot)com>
To: Abdul Qoyyuum <aqoyyuum(at)cardaccess(dot)com(dot)bn>
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Fixing or Mitigating this ERROR: invalid page in block 35217 of relation base/16421/3192429
Date: 2023-11-30 04:13:41
Message-ID: CAKt_Zfu-GC5+6H+25Afd32KaaDr0cvrugj_1Wk=YpZnjU+VtvA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, Nov 30, 2023 at 9:03 AM Abdul Qoyyuum <aqoyyuum(at)cardaccess(dot)com(dot)bn>
wrote:

> Hi Chris,
>
> On Wed, Nov 29, 2023 at 7:38 PM Chris Travers <chris(dot)travers(at)gmail(dot)com>
> wrote:
>
>>
>>
>> On Wed, Nov 29, 2023 at 4:36 PM Abdul Qoyyuum <aqoyyuum(at)cardaccess(dot)com(dot)bn>
>> wrote:
>>
>>> Hi all,
>>>
>>> Knowing that it's a data corruption issue, the only way to fix this is
>>> to vacuum and reindex the database. What was suggested was the following:
>>>
>>> SET zero_damaged_pages = 0; # This is so that we can have the
>>> application to continue to run
>>> VACUUM FULL VERBOSE ANALYSE; # Do a full vacuum and analyse the problem
>>> if possible.
>>> REINDEX DATABASE "core"; # Then do a reindex and clean it up.
>>>
>>
>> So first, to clear up some confusion on my part here:
>>
>> This procedure doesn't make a lot of sense to me. But did it clear up
>> the issue?
>>
> Yeah it did fix the issue before (same issue as last year) and it has
> fixed the problem that just happened a few days ago (almost exactly a year
> ago).
>
>>
>> In any of these cases, it is extremely important to diagnose the system
>> properly. If you have a fault in your storage device or RAID controller,
>> for example, you are asking for more corruption and data loss later.
>>
>
>> At first I thought maybe you mistyped something and then realized there
>> were a few issues with the process so it actually didn't make sense.
>>
>> First, zero_damaged_pages defaults to 0, and I can think of no reason to
>> set it explicitly.
>> Secondly, a vacuum full has to reindex, so there is no reason to do a
>> reindex following. Your whole procedure is limited to a vacuum full, when
>> a reindex is the only part that could affect this. If it did work,
>> reindexing is the only part that would have been helpful.
>>
> Oh that makes sense actually. Thanks.
>

So for a temporary workaround, it sounds like reindexing helps for now, but
yeah this really needs deeper investigation.

>
>> On to the question of what to do next....
>>
>>>
>>> We're on Postgresql 12. This has worked before it happened (almost
>>> exactly a year ago) and I think this needs a more permanent solution. I've
>>> looked at routine vacuuming and checked the autovacuum is set to on and the
>>> following configurations:
>>>
>>> core=> select name, setting from pg_settings where name like
>>> 'autovacuum%';
>>> name | setting
>>> -------------------------------------+-----------
>>> autovacuum | on
>>> autovacuum_analyze_scale_factor | 0.1
>>> autovacuum_analyze_threshold | 50
>>> autovacuum_freeze_max_age | 200000000
>>> autovacuum_max_workers | 3
>>> autovacuum_multixact_freeze_max_age | 400000000
>>> autovacuum_naptime | 60
>>> autovacuum_vacuum_cost_delay | 2
>>> autovacuum_vacuum_cost_limit | -1
>>> autovacuum_vacuum_scale_factor | 0.2
>>> autovacuum_vacuum_threshold | 50
>>> autovacuum_work_mem | -1
>>> (12 rows)
>>>
>>> Can anyone advise if there's anything else we can do? We have no clue
>>> what causes the invalid page block and we are running a High Availability
>>> cluster set up but we are hoping that there may be a way to mitigate it.
>>>
>>>
>> You need to figure out why the corruption is happening. This is most
>> likely, in my experience, not a PostgreSQL bug, but usually something that
>> happens on the hardware layer or an environmental factor. It could be
>> failin storage or CPU. Or it could be something like bad electrical input
>> or insufficient cooling (I have seen index and even table corruption issues
>> from both of these).
>>
>> If this is a server you run, the first things I would check are:
>> 1. Is there a good-quality UPS that the server is plugged into? Are the
>> batteries in good working order?
>>
> The servers are dual powered and hooked up to both supplied electricity,
> with a backup generator and if that fails, it will switch over to the UPS.
> All of these are supplied and maintained by the data centre that the
> servers are at. There have been no electrical problems so far.
>
>> 2. Is the server somewhere that may be sitting in a pocket of hot air?
>>
> As you can imagine, the data centre has air-conditioning and floored fans
> blowing hot air up and out, keeping all servers cooled. Checking on the
> blade servers that the database VM sits on shows that the temperature is
> optimally sitting at 65 degrees celsius. So I don't think it's a
> temperature problem either.
>

Ok so this is in a professionally run datacenter. That does indeed
eliminate at least some of these issues.

>
>> Once you have ruled these out, the next things to check are CPU, memory,
>> and storage health. Unfortunately checking these is harder but you can
>> check SMART indications, and other diagnostic indicators.
>>
> Would this top stats be useful? The database does get busy especially at
> peak hours.
>
> top - 10:04:25 up 782 days, 43 min, 2 users, load average: 1.20, 1.14,
> 1.10
> Tasks: 415 total, 3 running, 412 sleeping, 0 stopped, 0 zombie
> Cpu(s): 22.7%us, 13.8%sy, 0.0%ni, 63.2%id, 0.2%wa, 0.0%hi, 0.1%si,
> 0.1%st
> Mem: 8160156k total, 7832828k used, 327328k free, 34584k buffers
> Swap: 1048572k total, 304396k used, 744176k free, 6674428k cached
>
> We are running on a SAN group disk storage. I guess I should take a closer
> look at the disks. Thanks for the pointer.
>

anything in mcelog? That would be a place to check for memory issues if
you have ECC RAM (which I assume you do in a datacenter).

As far as the SAN goes, another thing I would pay attention to would be
whether there are any indication of fsync errors in the dmesg, or other
logs. In addition, of course, to a review of disk errors themselves.

>
>>
>> However, once these errors start happening, you are in danger territory
>> and need to find out why (and correct the underlying problem) before you
>> get data loss.
>>
> Thanks for the warning. We'll see what we can find.
>

>
>> --
>> Best Wishes,
>> Chris Travers
>>
>> Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor
>> lock-in.
>> http://www.efficito.com/learn_more
>>
>

--
Best Wishes,
Chris Travers

Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor
lock-in.
http://www.efficito.com/learn_more

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Loles 2023-11-30 13:37:26 Two started cluster on the same data directory and port
Previous Message Abdul Qoyyuum 2023-11-30 02:03:19 Re: Fixing or Mitigating this ERROR: invalid page in block 35217 of relation base/16421/3192429