Re: BUG #16760: Standby database missed records for at least 1 table

From: Andriy Bartash <abartash(at)xmatters(dot)com>
To: Euler Taveira <euler(dot)taveira(at)2ndquadrant(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #16760: Standby database missed records for at least 1 table
Date: 2020-12-09 00:03:11
Message-ID: SA2PR02MB7547D9B3F15931FCDBDAFCBABBCD0@SA2PR02MB7547.namprd02.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi Euler
I inspected pages where I expected to have my data on broken standby.
What was done:

1. Identified the page where was my data (on primary cluster)

select ctid,audit_ev_id from audit_evs_all where audit_ev_id between 221535154 and 221535162;

and it returned
ctid,audit_ev_id
"(1979,3)",221535154
"(1980,13)",221535155
"(1980,14)",221535156
"(1980,28)",221535157
"(1980,29)",221535158
"(1982,6)",221535159
"(1982,7)",221535160
"(1982,32)",221535161
"(1982,33)",221535162

Found that page # 1982 had 4 records on primary (ev_id 221535159, 221535160, 221535161 and 221535162)

1. Same query on Broken secondary showed that page # 1982 had only 1 record

(1982,33) | 221535162

(1 row)

1. Managed to restore postgres cluster from physical backup as of 24hr after secondary became corrupted and got all records from page # 1982

SELECT when_created,(ctid::text::point)[0]::bigint AS page_number, audit_ev_id from audit_evs_all where (ctid::text::point)[0]::bigint = 1982 and when_created <'2020-12-02' order by 1 ;
when_created | page_number | audit_ev_id
-------------------------------+-------------+-------------
2020-11-04 05:39:40.524816+00 | 1982 | 218065111
2020-11-04 05:39:41.551833+00 | 1982 | 218065112
2020-11-09 08:57:29.458528+00 | 1982 | 218711001
2020-11-09 08:57:29.458528+00 | 1982 | 218711000
2020-11-15 00:40:10.2285+00 | 1982 | 219765000
2020-11-15 00:40:10.2285+00 | 1982 | 219765001
2020-11-15 00:40:10.2285+00 | 1982 | 219765003
2020-11-15 00:40:10.2285+00 | 1982 | 219765004
2020-11-15 00:40:10.2285+00 | 1982 | 219765005
2020-11-15 00:40:10.2285+00 | 1982 | 219765006
2020-11-15 00:40:10.2285+00 | 1982 | 219765007
2020-11-15 00:40:10.2285+00 | 1982 | 219765008
2020-11-15 00:40:10.2285+00 | 1982 | 219765009
2020-11-15 00:40:10.2285+00 | 1982 | 219765010
2020-11-15 00:40:10.2285+00 | 1982 | 219765011
2020-11-15 00:40:10.2285+00 | 1982 | 219765002
2020-11-19 23:16:08.365151+00 | 1982 | 220198480
2020-11-19 23:16:08.365151+00 | 1982 | 220198479
2020-11-27 00:33:36.568531+00 | 1982 | 220990024
2020-11-27 00:33:36.568531+00 | 1982 | 220990023
2020-11-27 00:33:36.568531+00 | 1982 | 220990025
2020-12-01 00:00:20.955348+00 | 1982 | 221535161
2020-12-01 00:00:20.955348+00 | 1982 | 221535162
2020-12-01 00:00:20.955348+00 | 1982 | 221535160
2020-12-01 00:00:20.955348+00 | 1982 | 221535159
(25 rows)

1. Same query on broken secondary returns 22 records (same missed records with ev_id 221535159, 221535160, 221535161)

when_created | page_number | audit_ev_id
-------------------------------+-------------+-------------
2020-11-04 05:39:40.524816+00 | 1982 | 218065111
2020-11-04 05:39:41.551833+00 | 1982 | 218065112
2020-11-09 08:57:29.458528+00 | 1982 | 218711001
2020-11-09 08:57:29.458528+00 | 1982 | 218711000
2020-11-15 00:40:10.2285+00 | 1982 | 219765000
2020-11-15 00:40:10.2285+00 | 1982 | 219765001
2020-11-15 00:40:10.2285+00 | 1982 | 219765006
2020-11-15 00:40:10.2285+00 | 1982 | 219765007
2020-11-15 00:40:10.2285+00 | 1982 | 219765008
2020-11-15 00:40:10.2285+00 | 1982 | 219765009
2020-11-15 00:40:10.2285+00 | 1982 | 219765010
2020-11-15 00:40:10.2285+00 | 1982 | 219765011
2020-11-15 00:40:10.2285+00 | 1982 | 219765002
2020-11-15 00:40:10.2285+00 | 1982 | 219765003
2020-11-15 00:40:10.2285+00 | 1982 | 219765004
2020-11-15 00:40:10.2285+00 | 1982 | 219765005
2020-11-19 23:16:08.365151+00 | 1982 | 220198479
2020-11-19 23:16:08.365151+00 | 1982 | 220198480
2020-11-27 00:33:36.568531+00 | 1982 | 220990025
2020-11-27 00:33:36.568531+00 | 1982 | 220990024
2020-11-27 00:33:36.568531+00 | 1982 | 220990023
2020-12-01 00:00:20.955348+00 | 1982 | 221535162
(22 rows)

Conclusion: Secondary postgres cluster is corrupted, 3 records missed.

Does it make sense?

From: Euler Taveira <euler(dot)taveira(at)2ndquadrant(dot)com>
Date: Thursday, December 3, 2020 at 10:20 AM
To: Andriy Bartash <abartash(at)xmatters(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #16760: Standby database missed records for at least 1 table
On Wed, 2 Dec 2020 at 17:44, PG Bug reporting form <noreply(at)postgresql(dot)org<mailto:noreply(at)postgresql(dot)org>> wrote:
Below output from Primary
select audit_ev_id, when_created from audit_evs_all where audit_ev_id
between 221535154 and 221535162;
audit_ev_id | when_created
-------------+-------------------------------
221535154 | 2020-12-01 00:00:20.955348+00
221535155 | 2020-12-01 00:00:20.955348+00
221535156 | 2020-12-01 00:00:20.955348+00
221535157 | 2020-12-01 00:00:20.955348+00
221535158 | 2020-12-01 00:00:20.955348+00
221535159 | 2020-12-01 00:00:20.955348+00
221535160 | 2020-12-01 00:00:20.955348+00
221535161 | 2020-12-01 00:00:20.955348+00
221535162 | 2020-12-01 00:00:20.955348+00
(9 rows)
---------------------------------------------------------------
Same query's output from 1st Secondary
select audit_ev_id, when_created from audit_evs_all where audit_ev_id
between 221535154 and 221535162;
audit_ev_id | when_created
-------------+-------------------------------
221535162 | 2020-12-01 00:00:20.955348+00
(1 row)
I would start this investigation, disabling the index usage for this query.

-- check which index has been used
explain (analyze) select audit_ev_id, when_created from audit_evs_all where audit_ev_id
between 221535154 and 221535162;
-- disable index usage
begin;
set local enable_indexscan to 0;
set local enable_indexonlyscan to 0;
explain (analyze) select audit_ev_id, when_created from audit_evs_all where audit_ev_id
between 221535154 and 221535162;
select audit_ev_id, when_created from audit_evs_all where audit_ev_id
between 221535154 and 221535162;
commit;

If the second EXPLAIN isn't using the index shown in the first EXPLAIN and the
SELECT returns all rows, reindex the index that has been used. If the index is
not the culprit, you should inspect the page(s) that contain those rows using
pageinspect.

--
Euler Taveira http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2020-12-09 00:25:26 BUG #16768: PostgreSQL 12.5 - INSERT ON CONFLICT succeeds on data INSERT but does nothing on UPDATE
Previous Message Matthias van de Meent 2020-12-08 18:50:43 Re: BUG #16767: Silent dropping of CONSTRAINT... UNIQUE