From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dmitriy Sarafannikov <dsarafannikov(at)yandex(dot)ru> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Borodin Vladimir <root(at)simply(dot)name> |
Subject: | Re: Broken hint bits (freeze) |
Date: | 2017-05-26 18:39:48 |
Message-ID: | CAA4eK1L2n8stKZvOp3yrOCZJi4Z5eDuzmvhdPdbZUAc9qacRPg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, May 23, 2017 at 10:50 PM, Dmitriy Sarafannikov
<dsarafannikov(at)yandex(dot)ru> wrote:
> Hi hackers,
>
> We have some problems on our production with hint bits and frozen tuples.
> More and more following errors began to appear on master after switchover:
> ERROR: 58P01: could not access status of transaction 1952523525
> DETAIL: Could not open file "pg_clog/0746": No such file or directory.
> LOCATION: SlruReportIOError, slru.c:896
>
> We investigated the problem with pageinspect and found the tuples that are the cause:
>
> xdb311g(master)=# select * from mytable where ctid = '(4,21)';
> ERROR: 58P01: could not access status of transaction 1951521353
> DETAIL: Could not open file "pg_clog/0745": No such file or directory.
> LOCATION: SlruReportIOError, slru.c:896
>
> But the same query successfully executed on replica.
>
> We found some difference in hint bits between master and replica:
>
> xdb311g(master)=# SELECT t_xmin, t_infomask::bit(32) & X'0300'::int::bit(32) FROM heap_page_items(get_raw_page(‘mytable',4)) where lp=21;
> -[ RECORD 1 ]------------------------------
> t_xmin | 1951521353
> ?column? | 00000000000000000000000000000000
>
> old master, now replica:
> xdb311e(replica)=# SELECT t_xmin, t_infomask::bit(32) & X'0300'::int::bit(32) FROM heap_page_items(get_raw_page(‘mytable',4)) where lp=21;
> -[ RECORD 1 ]------------------------------
> t_xmin | 1951521353
> ?column? | 00000000000000000000001100000000
>
> X’0300’ = HEAP_XMIN_FROZEN according to
>
> #define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */
> #define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */
> #define HEAP_XMIN_FROZEN (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)
>
> xdb311g(master)=# select relfrozenxid from pg_class where relname = ‘mytable';
> relfrozenxid
> --------------
> 2266835605
> (1 row)
>
> This tuple must be frozen but there are no set bits HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID on master
>
> Another interesting thing that LSN of this page on master and replica are not the same:
> xdb311g(master)=# select lsn from page_header(get_raw_page(‘mytable',4));
> lsn
> ---------------
> 8092/6A26DD08
> (1 row)
>
> xdb311e(replica)=# select lsn from page_header(get_raw_page(‘mytable',4));
> lsn
> ---------------
> 838D/C4A0D280
> (1 row)
>
> And LSN on replica is greater that LSN on master (838D/C4A0D280 > 8092/6A26DD08)
> How can this be possible?
>
Yeah, I think this is quite suspicious. This seems to indicate that
not all WAL records are replicated before the switchover. What is the
value of "synchronous_commit" you are using? I think you somehow need
to ensure before switchover that all the WAL is replicated to ensure
this is not a setup problem.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Jeff Janes | 2017-05-26 19:07:50 | Re: logical replication - still unstable after all these months |
Previous Message | Thomas Munro | 2017-05-26 18:19:37 | Re: Extra Vietnamese unaccent rules |