Re: Print physical file path when checksum check fails

From: Hubert Zhang <hzhang(at)pivotal(dot)io>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Shawn Debnath <sdn(at)amazon(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Print physical file path when checksum check fails
Date: 2020-03-19 10:29:19
Message-ID: CAB0yrem+fGBtMB15Jxy64Qoq9iB-pQfy0BqfpSD4roGmWpkKOQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I have updated the patch based on the previous comments. Sorry for the late
patch.

I removed `SetZeroDamagedPageInChecksum` and add `zeroDamagePage` flag in
smgrread to determine whether we should zero damage page when an error
happens. It depends on the caller.

`GetRelationFilePath` is removed as well. We print segno on the fly.

On Thu, Feb 20, 2020 at 2:33 PM Hubert Zhang <hzhang(at)pivotal(dot)io> wrote:

> Thanks,
>
> On Thu, Feb 20, 2020 at 11:36 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
>> Hi,
>>
>> On 2020-02-19 16:48:45 +0900, Michael Paquier wrote:
>> > On Wed, Feb 19, 2020 at 03:00:54PM +0900, Kyotaro Horiguchi wrote:
>> > > I have had support requests related to broken block several times, and
>> > > (I think) most of *them* had hard time to locate the broken block or
>> > > even broken file. I don't think it is useles at all, but I'm not sure
>> > > it is worth the additional complexity.
>> >
>> > I handle stuff like that from time to time, and any reports usually
>> > go down to people knowledgeable about PostgreSQL enough to know the
>> > difference. My point is that in order to know where a broken block is
>> > physically located on disk, you need to know four things:
>> > - The block number.
>> > - The physical location of the relation.
>> > - The size of the block.
>> > - The length of a file segment.
>> > The first two items are printed in the error message, and you can
>> > guess easily the actual location (file, offset) with the two others.
>>
>> > I am not necessarily against improving the error message here, but
>> > FWIW I think that we need to consider seriously if the code
>> > complications and the maintenance cost involved are really worth
>> > saving from one simple calculation.
>>
>> I don't think it's that simple for most.
>>
>> And if we e.g. ever get the undo stuff merged, it'd get more
>> complicated, because they segment entirely differently. Similar, if we
>> ever manage to move SLRUs into the buffer pool and checksummed, it'd
>> again work differently.
>>
>> Nor is it architecturally appealing to handle checksums in multiple
>> places above the smgr layer: For one, that requires multiple places to
>> compute verify them. But also, as the way checksums are computed depends
>> on the page format etc, it'll likely change for things like undo/slru -
>> which then again will require additional smarts if done above the smgr
>> layer.
>>
>
> So considering undo staff, it's better to move checksum logic into md.c
> I will keep it in the new patch.
>
> On 2020-02-19 16:48:45 +0900, Michael Paquier wrote:
>
> > Particularly, quickly reading through the patch, I am rather unhappy
>> > about the shape of the second patch which pushes down the segment
>> > number knowledge into relpath.c, and creates more complication around
>> > the handling of zero_damaged_pages and zero'ed pages. -- Michael
>>
>> I do not like the SetZeroDamagedPageInChecksum stuff at all however.
>>
>>
> I'm +1 on the first concern, and I will delete the new added function
> `GetRelationFilePath`
> in relpath.c and append the segno directly in error message inside
> function `VerifyPage`
>
> As for SetZeroDamagedPageInChecksum, the reason why I introduced it is
> that when we are doing
> smgrread() and one of the damaged page failed to pass the checksum check,
> we could not directly error
> out, since the caller of smgrread() may tolerate this error and just zero
> all the damaged page plus a warning message.
> Also, we could not just use GUC zero_damaged_pages to do this branch,
> since we also have ReadBufferMode(i.e. RBM_ZERO_ON_ERROR) to control it.
>
> To get rid of SetZeroDamagedPageInChecksum, one idea is to pass
> zero_damaged_page flag into smgrread(), something like below:
> ==
>
> extern void smgrread(SMgrRelation reln, ForkNumber forknum,
>
> BlockNumber blocknum, char *buffer, int flag);
>
> ===
>
>
> Any comments?
>
>
>
> --
> Thanks
>
> Hubert Zhang
>

--
Thanks

Hubert Zhang

Attachment Content-Type Size
0003-Print-physical-file-path-when-verify-checksum-failed.patch application/octet-stream 13.0 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2020-03-19 10:35:41 Re: Auxiliary Processes and MyAuxProc
Previous Message wenjing.zwj 2020-03-19 10:21:25 Re: [Proposal] Global temporary tables