Re: "PANIC: could not open critical system index 2662" - twice

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Evgeny Morozov <postgresql3(at)realityexists(dot)net>, PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: "PANIC: could not open critical system index 2662" - twice
Date: 2023-05-08 13:45:20
Message-ID: CAFiTN-uKc46MGgkCB9Dim14_Xq23NQznZXSLRdZ-hgjxVBGRYA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, May 8, 2023 at 7:55 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> On Sun, May 07, 2023 at 10:30:52PM +1200, Thomas Munro wrote:
> > Bug-in-PostgreSQL explanations could include that we forgot it was
> > dirty, or some backend wrote it out to the wrong file; but if we were
> > forgetting something like permanent or dirty, would there be a more
> > systematic failure? Oh, it could require special rare timing if it is
> > similar to 8a8661828's confusion about permanence level or otherwise
> > somehow not setting BM_PERMANENT, but in the target blocks, so I think
> > that'd require a checkpoint AND a crash. It doesn't reproduce for me,
> > but perhaps more unlucky ingredients are needed.
> >
> > Bug-in-OS/FS explanations could include that a whole lot of writes
> > were mysteriously lost in some time window, so all those files still
> > contain the zeroes we write first in smgrextend(). I guess this
> > previously rare (previously limited to hash indexes?) use of sparse
> > file hole-punching could be a factor in an it's-all-ZFS's-fault
> > explanation:
>
> Yes, you would need a bit of all that.
>
> I can reproduce the same backtrace here. That's just my usual laptop
> with ext4, so this would be a Postgres bug. First, here are the four
> things running in parallel so as I can get a failure in loading a
> critical index when connecting:
> 1) Create and drop a database with WAL_LOG as strategy and the
> regression database as template:
> while true; do
> createdb --template=regression --strategy=wal_log testdb;
> dropdb testdb;
> done
> 2) Feeding more data to pg_class in the middle, while testing the
> connection to the database created:
> while true;
> do psql -c 'create table popo as select 1 as a;' regression > /dev/null 2>&1 ;
> psql testdb -c "select 1" > /dev/null 2>&1 ;
> psql -c 'drop table popo' regression > /dev/null 2>&1 ;
> psql testdb -c "select 1" > /dev/null 2>&1 ;
> done;
> 3) Force some checkpoints:
> while true; do psql -c 'checkpoint' > /dev/null 2>&1; sleep 4; done
> 4) Force a few crashes and recoveries:
> while true ; do pg_ctl stop -m immediate ; pg_ctl start ; sleep 4 ; done
>

I am able to reproduce this using the steps given above, I am also
trying to analyze this further. I will send the update once I get
some clue.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Oscar Carlberg 2023-05-08 14:35:14 ICU, locale and collation question
Previous Message Tom Lane 2023-05-08 13:44:18 Re: huge discrepancy between EXPLAIN cost and actual time (but the table has just been ANALYZED)