Quick Links

Add some more corruption error codes to relcache

From:	"Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>
To:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Add some more corruption error codes to relcache
Date:	2023-06-16 13:17:48
Message-ID:	8EFE369D-A0BC-4698-B506-68A62C147A42@yandex-team.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi hackers,

Relcache errors from time to time detect catalog corruptions. For example, recently I observed following:
1. Filesystem or nvme disk zeroed out leading 160Kb of catalog index. This type of corruption passes through data_checksums.
2. RelationBuildTupleDesc() was failing with "catalog is missing 1 attribute(s) for relid 2662".
3. We monitor corruption error codes and alert on-call DBAs when see one, but the message is not marked as XX001 or XX002. It's XX000 which happens from time to time due to less critical reasons than data corruption.
4. High-availability automation switched primary to other host and other monitoring checks did not ring too.

This particular case is not very illustrative. In fact we had index corruption that looked like catalog corruption.
But still it looks to me that catalog inconsistencies (like relnatts != number of pg_attribute rows) could be marked with ERRCODE_DATA_CORRUPTED.
This particular error code in my experience proved to be a good indicator for early corruption detection.

What do you think?
What other subsystems can be improved in the same manner?

Best regards, Andrey Borodin.

Attachment	Content-Type	Size
v1-0001-Add-corruption-error-codes-to-relcache-entries.patch	application/octet-stream	3.6 KB

Responses

Re: Add some more corruption error codes to relcache at 2023-06-27 03:32:52 from Kirk Wolak

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Aleksander Alekseev	2023-06-16 13:20:38	Re: Pluggable toaster
Previous Message	jian he	2023-06-16 11:57:05	Re: Do we want a hashset type?