Re: Corruption during WAL replay

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, deniel1495(at)mail(dot)ru, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>, tejeswarm(at)hotmail(dot)com, hlinnaka <hlinnaka(at)iki(dot)fi>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Daniel Wood <hexexpert(at)comcast(dot)net>
Subject: Re: Corruption during WAL replay
Date: 2022-03-25 03:43:01
Message-ID: 20220325034301.htu27xf54xjgyoca@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-03-24 19:43:02 -0700, Andres Freund wrote:
> Just to be sure I'm going to clean out serinus' ccache dir and rerun. I'll
> leave dragonet's alone for now.

Turns out they had the same dir. But it didn't help.

I haven't yet figured out why, but I now *am* able to reproduce the problem in
the buildfarm built tree. Wonder if there's a path length issue or such
somewhere?

Either way, I can now manipulate the tests and still repro. I made the test
abort after the first failure.

hexedit shows that the file is modified, as we'd expect:
00000000 00 00 00 00 C0 01 5B 01 16 7D 00 00 A0 03 C0 03 00 20 04 20 00 00 00 00 00 00 00 00 00 00 00 00 ......[..}....... . ............
00000020 00 9F 38 00 80 9F 38 00 60 9F 38 00 40 9F 38 00 20 9F 38 00 00 9F 38 00 E0 9E 38 00 C0 9E 38 00 (dot)(dot)8(dot)(dot)(dot)8(dot)`(dot)8(dot)(at)(dot)8(dot) .8...8...8...8.

And we are checking the right file:

bf(at)andres-postgres-edb-buildfarm-v1:~/build/buildfarm-serinus/HEAD/pgsql.build$ tmp_install/home/bf/build/buildfarm-serinus/HEAD/inst/bin/pg_checksums --check -D /home/bf/build/buildfarm-serinus/HEAD/pgsql.build/src/bin/pg_checksums/tmp_check/t_002_actions_node_checksum_data/pgdata --filenode 16391 -v
pg_checksums: checksums verified in file "/home/bf/build/buildfarm-serinus/HEAD/pgsql.build/src/bin/pg_checksums/tmp_check/t_002_actions_node_checksum_data/pgdata/pg_tblspc/16387/PG_15_202203241/5/16391"
Checksum operation completed
Files scanned: 1
Blocks scanned: 45
Bad checksums: 0
Data checksum version: 1

If I twiddle further bits, I see that page failing checksum verification, as
expected.

I made the script copy the file before twiddling it around:
00000000 00 00 00 00 C0 01 5B 01 16 7D 00 00 A0 03 C0 03 00 20 04 20 00 00 00 00 E0 9F 38 00 C0 9F 38 00 ......[..}....... . ......8...8.
00000020 A0 9F 38 00 80 9F 38 00 60 9F 38 00 40 9F 38 00 20 9F 38 00 00 9F 38 00 E0 9E 38 00 C0 9E 38 00 (dot)(dot)8(dot)(dot)(dot)8(dot)`(dot)8(dot)(at)(dot)8(dot) .8...8...8...8.

So it's indeed modified.

The only thing I can really conclude here is that we apparently end up with
the same checksum for exactly the modifications we are doing? Just on those
two damn instances? Reliably?

Gotta make some food. Suggestions what exactly to look at welcome.

Greetings,

Andres Freund

PS: I should really rename the hostname of that machine one of these days...

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2022-03-25 03:54:14 Re: Assert in pageinspect with NULL pages
Previous Message Michael Paquier 2022-03-25 03:27:24 Re: Assert in pageinspect with NULL pages