Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

From: Pawel Kudzia <kudzia(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>
Subject: Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
Date: 2024-06-13 18:49:48
Message-ID: CAJYBUS-9-8Op5o7M0Fgh4=MKR0DrSBoS2RYqHzAk-30=paSeFQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Mon, Sep 6, 2021 at 10:18 AM Pawel Kudzia <kudzia(at)gmail(dot)com> wrote:
>
> On Mon, Aug 2, 2021 at 10:07 AM Pawel Kudzia <kudzia(at)gmail(dot)com> wrote:
> >
> > On Thu, Jun 17, 2021 at 4:07 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > >
> > > Pawel Kudzia <kudzia(at)gmail(dot)com> writes:
> > > > with help from IRC we've found that decreasing work_mem from 1MB to 256kB
> > > > or less makes the problem go away:
> > >
> > > Hmm. So that suggests that the index itself is *not* corrupt,
> > > but the problem is associated with a bug in the indexscan
> > > algorithms.
> > >
> > > Have you experimented with different index opclasses? Your
> > > original report said you were using gin__int_ops, but that's
> > > not the default, and we sort of deprecate intarray these days.
> >
> > Hello,
> >
> > Three weeks ago we've changed from
> >
> > CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin
> > (attribute_name_ids public.gin__int_ops);
> >
> > into
> >
> > CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin
> > (attribute_name_ids);
> >
> > Sadly - the problems are back.
> >
> > I have:
> > * queries that return incorrect responses:
> > SET enable_seqscan = OFF;
> > SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{1485}' )
> > AND NOT ( (attribute_name_ids||0) && '{1485}') LIMIT 10;
> > SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{144}' )
> > AND NOT ( (attribute_name_ids||0) && '{144}') LIMIT 10;
> > SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{2652}' )
> > AND NOT ( (attribute_name_ids||0) && '{2652}') LIMIT 10;
> >
> > all three return:
> > entity_id | primname
> > -----------+--------------------
> > 99311962 | msci_674591ltepsgt
> > 99312880 | msci_674667mgu
> >
> > data=# explain analyze SELECT entity_id FROM entity WHERE (
> > attribute_name_ids && '{2652}' ) AND NOT ( (attribute_name_ids||0) &&
> > '{2652}') LIMIT 10;
> > QUERY PLAN
> > -----------------------------------------------------------------------------------------------------------------------------------------------------
> > Limit (cost=30.59..31.71 rows=10 width=8) (actual
> > time=149.988..200.749 rows=2 loops=1)
> > -> Bitmap Heap Scan on entity (cost=30.59..3798.81 rows=33644
> > width=8) (actual time=149.985..200.744 rows=2 loops=1)
> > Recheck Cond: (attribute_name_ids && '{2652}'::integer[])
> > Rows Removed by Index Recheck: 79741
> > Filter: (NOT ((attribute_name_ids || 0) && '{2652}'::integer[]))
> > Rows Removed by Filter: 16723
> > Heap Blocks: exact=8277 lossy=9989
> > -> Bitmap Index Scan on entity_attribute_name_ids_gin
> > (cost=0.00..30.59 rows=33813 width=0) (actual time=9.712..9.713
> > rows=37367 loops=1)
> > Index Cond: (attribute_name_ids && '{2652}'::integer[])
> > Planning Time: 0.472 ms
> > Execution Time: 200.977 ms
> > (11 rows)
> >
> >
> > * first time when hourly consistency check reported the inconsistency
> > [ 2021-08-01 19:04:15 for entity_id=99312880, 2021-08-01 23:04:15 for
> > entity_id=99311962 ]
> > * timestamps when affected row has been updated [ entity_id=99312880:
> > 2021-08-01 16:51:59, 2021-08-01 19:04:06 [ most likely that update led
> > to inconsistency ], 2021-08-01 23:04:14, 2021-08-02 01:08:18,
> > 2021-08-02 05:12:15, entity_id=99311962 2021-08-01 16:51:59,
> > 2021-08-01 19:04:05, 2021-08-01 23:04:13 [ likely culprit ],
> > 2021-08-02 01:08:16 ]
> > * wal files from the **master server** from the time when the
> > inconsistency started to occur [ from 2021-08-01 14:50:37 till
> > 2021-08-02 04:31:00 ; sadly i don't have enough to cover the whole
> > period between snapshots mentioned below ]
> > * file-level snapshot of the database taken from streaming
> > **replication slave** before the problem started to occur [ 2021-08-01
> > 12:00:01 ] and after it occured [ 2021-08-02 00:00:01 ]
> >
> > I'll be happy to run any diagnostics needed, provide access to
> > particular wal file.
> >
>
> Hello,
>
> We continue to run into this issue and we still do not have any
> reliable method to reproduce it. Is there any diagnostic information
> that we could provide to make it easier to resolve?
>
> Thank you!
>
>

Hello,

I'd like to revive this thread. In the meanwhile we've upgraded to
PostgreSQL 16.3-1.pgdg110+1, moved to new x64 hardware and upgraded to
Debian Bookworm.

And we still observe occasional GIN index corruption resulting in
'impossible' queries like below returning rows which clearly do not
match the WHERE criteria:

data=# SELECT entity_id FROM entity WHERE ( attribute_name_ids &&
'{49}' ) AND NOT ( (attribute_name_ids||0) && '{49}');
entity_id
-----------
88036166
(1 row)

data=# explain analyze SELECT entity_id FROM entity WHERE (
attribute_name_ids && '{49}' ) AND NOT ( (attribute_name_ids||0) &&
'{49}');
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on entity (cost=59.10..361.74 rows=3011 width=8)
(actual time=41.614..46.918 rows=1 loops=1)
Recheck Cond: (attribute_name_ids && '{49}'::integer[])
Filter: (NOT ((attribute_name_ids || 0) && '{49}'::integer[]))
Rows Removed by Filter: 2798
Heap Blocks: exact=4715
-> Bitmap Index Scan on entity_attribute_name_ids_gin
(cost=0.00..59.10 rows=3026 width=0) (actual time=6.563..6.564
rows=17842 loops=1)
Index Cond: (attribute_name_ids && '{49}'::integer[])
Planning Time: 0.115 ms
Execution Time: 47.032 ms
(9 rows)

The only step needed to get into this state is to leave our update
processes running for few weeks.

Our workaround, which has been working OK for us for the past 4 years,
is to run REINDEX INDEX CONCURRENTLY entity_attribute_value_ids_gin
every 24h.

If there's any additional diagnostic data that we can provide to help
in finding and fixing the root cause - please let me know.

Thank you!

--
regards,
Pawel Kudzia

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2024-06-13 22:04:20 Re: BUG #18479: websearch_to_tsquery inconsistent behavior for german when using parentheses
Previous Message Haifang Wang (Centific Technologies Inc) 2024-06-13 18:46:31 RE: [EXTERNAL] Re: Windows Application Issues | PostgreSQL | REF # 48475607