Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

From: Pawel Kudzia <kudzia(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
Date: 2021-07-16 07:06:59
Message-ID: CAJYBUS9_VUUYfAen+_pWGa4xxRR-2uvX+_Ycg6W+NqBXVwLY2Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Jul 15, 2021 at 8:49 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:

> On 15/07/2021 12:38, Pawel Kudzia wrote:
> >>> How can I identify relevant pages to make selective dump of them for
> >>> you, using method proposed by Peter?
> >>
> >> It's a bit hard without knowing which index pages are affected. But the
> >> attached script (dump-gin-page-info.sql) dumps some summary information
> >> about the kind of pages there are. That's a start.
> >
> > I'm attaching output of the proposed commands.
>
> Thanks!
>
> >> Let's try one thing first: I rebased a patch to add GIN support to the
> >> amcheck extension [1]. I kind of doubt that it will catch the corruption
> >> in this case, but you never know. So please apply the attached
> >> v2-0001-Amcheck-for-GIN-13stable.patch patch. It is the same patch I
> >> posted at [1], but the installation script is slightly different to make
> >> it work on PostgreSQL v13. Then compile install it:
> >>
> >> cd contrib/amcheck; make install
> >>
> >> Then in psql:
> >>
> >> CREATE EXTENSION amcheck;
> >> gin_index_parent_check('entity_attribute_name_ids_gin');
> >
> > Sadly i'm getting this when executing CREATE EXTENSION amcheck;
> >
> > ERROR: extension "amcheck" has no installation script nor update path
> > for version "1.2.1"
>
> Hmm, that's odd, it worked for me. Did "make install" copy the
> "amcheck--1.2--1.2.1.sql" file to the right place? It should be
> installed the same directory as "amcheck.control".
>
>

thank you for the hint. i've messed up patching. now i can get one step
further:

data=# CREATE EXTENSION amcheck;
CREATE EXTENSION
data=# gin_index_parent_check('entity_attribute_name_ids_gin');
ERROR: syntax error at or near "gin_index_parent_check"
LINE 1: gin_index_parent_check('entity_attribute_name_ids_gin');
^

i've also applied trace-gin-readbuffer-2.patch

> Anyway, let's try a different tack. Here's another debugging patch. It
> causes a LOG message to be printed whenever a GIN index page is
> accessed, like this:
>
> 2021-07-15 21:22:20.119 EEST [972708] LOG: ReadBuffer 1663/12678/27035
> read gin blk 0 (ginget.c:1832 scanPendingInsert)
> 2021-07-15 21:22:20.119 EEST [972708] STATEMENT: explain analyze select
> * from test_intarray where a @> '{2}';
> 2021-07-15 21:22:20.119 EEST [972708] LOG: ReadBuffer 1663/12678/27035
> read gin blk 1 (ginbtree.c:89 ginFindLeafPage)
> 2021-07-15 21:22:20.119 EEST [972708] STATEMENT: explain analyze select
> * from test_intarray where a @> '{2}';
>
> Please apply this patch, run the query again, and copy the resulting
> log.

that'll be the file called "log", send directly to your e-mail address.

> I'm also interested in the contents of the pages that are accessed,
> so if you could then run these commands to dump the contents of those
> index pages:
>
> # extract just the block numbers from the log
> cat <path to postgres log> | perl -ne '/.*read gin blk (\d+)/ && print
> "$1\n" ' | sort -n |uniq > /tmp/blocknums
>
> # dump contents of each of those index pages to '/tmp/block-contents'
> psql postgres -c "create temp table blocknums(blknum int); copy
> blocknums from '/tmp/blocknums'; copy (select blknum,
> get_raw_page('entity_attribute_name_ids_gin', blknum) from blocknums) to
> '/tmp/block-contents';"
>

done, side note - i had to run "create extension pageinspect" and provide
name of my database - "data" rather than "postgres".

Please send over the resulting PostgreSQL log, and
> '/tmp/block-contents', and I'll take a closer look into the contents to
> see if there's something funny. (A private email is OK if you don't feel
> comfortable sharing it with the world).
>
> One more thing, please also add the 'ctid' virtual column to the query,
> like this:
>
> SELECT ctid, entity_id FROM entity WHERE
> ( attribute_name_ids && '{1737}' ) AND NOT ( (attribute_name_ids||0)
> && '{1737}') LIMIT 10;
>

that's how the queries above were run. output:

ctid | entity_id
--------------+-----------
(4002784,1) | 38048120
(4002869,14) | 95333744
(2 rows)

> Also, did you happen to keep the WAL log from around the time that the
> query started producing wrong results? That would also be useful
> evidence for how it got into this situation.
>

sadly i don't. as noted earlier - we've already switched the production
setup so i have no way of leading to corruption again, i only have two
file-level backups of the database with two different inconsistencies in
the SELECTs.

greetings!

--
regards,
Pawel Kudzia

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Sergei Kornilov 2021-07-16 07:53:03 Re: BUG #17111: Database created, cannot be created, but reported as inexist
Previous Message Alvaro Herrera 2021-07-16 02:49:11 Re: BUG #17103: WAL segments are not removed after exceeding max_slot_wal_keep_size