Quick Links

Re: Logging corruption error codes

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org, Анна Крханбарова <annkpx(at)yandex-team(dot)ru>, Dmitriy Sarafannikov <dsarafan(at)yandex-team(dot)ru>
Subject:	Re: Logging corruption error codes
Date:	2019-06-24 01:59:20
Message-ID:	20190624015920.GA1637@paquier.xyz
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

On Fri, Jun 21, 2019 at 03:22:15PM +0500, Andrey Borodin wrote:
> 20 июня 2019 г., в 22:09, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> написал(а):
>> This is not totally insane -- other similar messages such as 'corrupted
>> page pointers' in bufpage.c get the same errcode.
>
> On master there is only
> elog(ERROR, "incorrect index offsets supplied");
> in bufpage.c. But this indicate misuse, not corrupted data on disk.
> Others already use ERRCODE_DATA_CORRUPTED.

Yeah. We may be able to expand the use of ERRCODE_DATA_CORRUPTED a
bit more in the area. No objections against that.

>> I would like to have a separate marking for messages indicating a
>> system-level permanent problem rather than user error ("table/column X
>> does not exist"), retryable condition ("serializability violation"), or
>> resource exhaustion ("out of memory", "too many clients"),
>
> Good idea, but there must be standards to which we comply?

I am not completely sure what you are going at here? Are you aiming
at creating more wrappers which are errno-specific, like
errcode_for_file_access() and such?

There is a set of categories in errcodes.txt which make a rather clear
classification of the different types of failures which can happen:
- Corruption cases are part of the "cannot happen" case.
- Serializable functions are related to transactionability.
- OOM is system-related.
- Too many clients is a Postgres-specific limitation.
So these look rather clear to me.

> I believe we should not report hardware problems as corruption. But
> this worries us (YC) too. Do you think that this problem deserve a
> patch?

The difference may not be easy here. Postgres has no idea if the
problem comes from the FS, the kernel or the hardware itself. We just
report the problem that the OS tells us about so I don't think that we
should try to be smarter than the OS telling us something. elog()
points to an internal error which should never happen. By that, it
seems to me that we should use elog() for cases where some code paths
should never be logically reached. Now, there is no hard line here,
sometimes we finish by using elog() on purpose as a sanity check if
for example catalog data gets corrupted (you cannot use an assertion
if you rely on an on-page state, obviously), so improving things with
a case-by-case approach is fine in my opinion..

> If we introduce new error code - this is, kind of, new
> feature. Should I send it to pgsql-hackers?

Yes, I don't think that new error codes would gain a back-patch.
So discussing that on -hackers would be more adapted in my opinion.

Looking at the patch you propose, I don't have an argument against
applying what you propose FWIW. These suggestions seem sensible. But
that's not really a bug as the code works properly even without your
changes.

Note: the indentation of the patch is incorrect everywhere. A
committer applying your patch would apply pgindent anyway. :)
--
Michael

In response to

Re: Logging corruption error codes at 2019-06-21 10:22:15 from Andrey Borodin

Responses

Re: Logging corruption error codes at 2019-06-25 10:01:23 from Andrey Borodin

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Michael Paquier	2019-06-24 02:17:28	Re: Function pg_database_size fails with "Permission denied" on a corrupted fsm file
Previous Message	PG Bug reporting form	2019-06-23 15:31:48	BUG #15869: Custom aggregation returns null when parallelized