Re: Issue on restore / recover

From: flumbador(at)virgilio(dot)it
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, pgsql-admin(at)lists(dot)postgresql(dot)org
Subject: Re: Issue on restore / recover
Date: 2018-01-10 16:28:32
Message-ID: 858243188.823822.1515601712820@mail.virgilio.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi All,

I reproduced the issue with the following test case:

1) Create a table for testing porpouse:

pdbabe=# CREATE TABLE test_corrupt (order_date timestamptz, item char(1024));
CREATE TABLE

2) Insert some data in the table

pdbabe=# INSERT INTO test_corrupt (order_date, item)
SELECT x, 'generic item'
FROM generate_series('2018-01-01 00:00:00'::timestamptz, '2018-01-01 00:10:00'::timestamptz,'60 seconds'::interval) a(x);
INSERT 0 11

pdbabe=# select order_date, substring(item, 1, 20) from test_corrupt ;
order_date | substring
------------------------+--------------
2018-01-01 00:00:00+01 | generic item
2018-01-01 00:01:00+01 | generic item
2018-01-01 00:02:00+01 | generic item
2018-01-01 00:03:00+01 | generic item
2018-01-01 00:04:00+01 | generic item
2018-01-01 00:05:00+01 | generic item
2018-01-01 00:06:00+01 | generic item
2018-01-01 00:07:00+01 | generic item
2018-01-01 00:08:00+01 | generic item
2018-01-01 00:09:00+01 | generic item
2018-01-01 00:10:00+01 | generic item

3) Check the file that contain the table:

pdbabe=# SELECT pg_relation_filepath('test_corrupt');
pg_relation_filepath
----------------------------------------------
pg_tblspc/16385/PG_9.4_201409291/16386/49391
(1 row)

pgpdbabe(at)elaawddbad00:/pgsql/pdbabe/data/pdbabe/pg_tblspc/16385/PG_9.4_201409291/16386> ls -la | grep 49391
-rw------- 1 pgpdbabe pdbabe 16384 Jan 9 14:13 49391
-rw------- 1 pgpdbabe pdbabe 24576 Jan 9 14:13 49391_fsm

4) At 2.40 A.M. Start the backup of the database (filesystem backup):

pg_start_backup
-----------------
17/30000028
(1 row)

5) At 2:57 A.M. new record is inserted in the table.

insert into test_corrupt values (current_timestamp, 'new item') ;

After the insert the content of the table is:

pdbabe=# select order_date, substring(item, 1, 20) from test_corrupt ;
order_date | substring
-------------------------------+--------------
2018-01-01 00:00:00+01 | generic item
2018-01-01 00:01:00+01 | generic item
2018-01-01 00:02:00+01 | generic item
2018-01-01 00:03:00+01 | generic item
2018-01-01 00:04:00+01 | generic item
2018-01-01 00:05:00+01 | generic item
2018-01-01 00:06:00+01 | generic item
2018-01-01 00:07:00+01 | generic item
2018-01-01 00:08:00+01 | generic item
2018-01-01 00:09:00+01 | generic item
2018-01-01 00:10:00+01 | generic item
2018-01-10 02:57:01.106676+01 | new item
(12 rows)

No other insert / update / delete are executed against the table until the restore / recover.

6) The backup of filesystem on which resides the Postgresql, ended at 4:01 A.M

pg_stop_backup | now
----------------+-------------------------------
17/7EFE3900 | 2018-01-10 04:01:53.496869+01
(1 row)

8) Restore of the postgresql database, by executung the restore of the previous filesystem backup.

9) Contets of backup_label restored:

pgpdbabe(at)elaawddbad00:/pgsql/pdbabe/data/pdbabe> more backup_label
START WAL LOCATION: 17/30000028 (file 000000070000001700000030)
CHECKPOINT LOCATION: 17/30000060
BACKUP METHOD: pg_start_backup
BACKUP FROM: master
START TIME: 2018-01-10 02:40:00 CET
LABEL: bckdb

10) Delete of the file containg the table "test_corrupt" to simulate the file missing in the backup:

pgpdbabe(at)elaawddbad00:/pgsql/pdbabe/data/pdbabe> cd $PGDATA/pg_tblspc/16385/PG_9.4_201409291/16386
pgpdbabe(at)elaawddbad00:/pgsql/pdbabe/data/pdbabe/pg_tblspc/16385/PG_9.4_201409291/16386> ls -la | grep 49391
-rw------- 1 pgpdbabe pdbabe 16384 Jan 10 02:57 49391
-rw------- 1 pgpdbabe pdbabe 24576 Jan 9 14:13 49391_fsm
pgpdbabe(at)elaawddbad00:/pgsql/pdbabe/data/pdbabe/pg_tblspc/16385/PG_9.4_201409291/16386> rm ./49391
pgpdbabe(at)elaawddbad00:/pgsql/pdbabe/data/pdbabe/pg_tblspc/16385/PG_9.4_201409291/16386> ls -la | grep 49391
-rw------- 1 pgpdbabe pdbabe 24576 Jan 9 14:13 49391_fsm
pgpdbabe(at)elaawddbad00:/pgsql/pdbabe/data/pdbabe/pg_tblspc/16385/PG_9.4_201409291/16386>

12) Create file recovery.conf:

restore_command = 'cp /pgsql/pdbabe/archive/%f %p'
archive_cleanup_command = 'pg_archivecleanup /pgsql/pdbabe/archive %r'

13) Start Postgresql to recover the database ... Recover is successfull ... NO error in logfile:

2018-01-10 12:05:25 CET [30299]: [1-1] user=,db=,app=,client= LOG: database system was interrupted; last known up at 2018-01
-10 02:50:00 CET
2018-01-10 12:05:25 CET [30299]: [2-1] user=,db=,app=,client= LOG: creating missing WAL directory "pg_xlog/archive_status"
2018-01-10 12:05:25 CET [30299]: [3-1] user=,db=,app=,client= LOG: starting archive recovery
2018-01-10 12:05:25 CET [30299]: [4-1] user=,db=,app=,client= LOG: restored log file "00000007.history" from archive
2018-01-10 12:05:25 CET [30299]: [5-1] user=,db=,app=,client= LOG: restored log file "000000070000001700000030" from archive
2018-01-10 12:05:25 CET [30299]: [6-1] user=,db=,app=,client= LOG: redo starts at 17/30000028
2018-01-10 12:05:25 CET [30299]: [7-1] user=,db=,app=,client= LOG: restored log file "000000070000001700000031" from archive
2018-01-10 12:05:26 CET [30299]: [8-1] user=,db=,app=,client= LOG: restored log file "000000070000001700000032" from archive
2018-01-10 12:05:26 CET [30299]: [9-1] user=,db=,app=,client= LOG: restored log file "000000070000001700000033" from archive
......
2018-01-10 12:05:46 CET [30299]: [81-1] user=,db=,app=,client= LOG: restored log file "00000007000000170000007B" from archiv
e
2018-01-10 12:05:48 CET [30299]: [82-1] user=,db=,app=,client= LOG: restored log file "00000007000000170000007C" from archiv
e
2018-01-10 12:05:49 CET [30299]: [83-1] user=,db=,app=,client= LOG: restored log file "00000007000000170000007D" from archiv
e
2018-01-10 12:05:50 CET [30299]: [84-1] user=,db=,app=,client= LOG: restored log file "00000007000000170000007E" from archiv
e
2018-01-10 12:05:50 CET [30299]: [85-1] user=,db=,app=,client= LOG: consistent recovery state reached at 17/7EFE3900
2018-01-10 12:05:50 CET [30297]: [3-1] user=,db=,app=,client= LOG: database system is ready to accept read only connections
2018-01-10 12:05:50 CET [30299]: [86-1] user=,db=,app=,client= LOG: restored log file "00000007000000170000007F" from archiv
e
2018-01-10 12:05:50 CET [30299]: [87-1] user=,db=,app=,client= LOG: restored log file "000000070000001700000080" from archiv
e
cp: cannot stat `/pgsql/pdbabe/archive/000000070000001700000081': No such file or directory
2018-01-10 12:05:50 CET [30299]: [88-1] user=,db=,app=,client= LOG: redo done at 17/800875F0
2018-01-10 12:05:50 CET [30299]: [89-1] user=,db=,app=,client= LOG: last completed transaction was at log time 2018-01-10 09
:17:01.870329+01
2018-01-10 12:05:53 CET [30299]: [90-1] user=,db=,app=,client= LOG: restored log file "000000070000001700000080" from archiv
e
cp: cannot stat `/pgsql/pdbabe/archive/00000008.history': No such file or directory
2018-01-10 12:05:53 CET [30299]: [91-1] user=,db=,app=,client= LOG: selected new timeline ID: 8
2018-01-10 12:05:53 CET [30299]: [92-1] user=,db=,app=,client= LOG: restored log file "00000007.history" from archive
2018-01-10 12:05:53 CET [30299]: [93-1] user=,db=,app=,client= LOG: archive recovery complete
2018-01-10 12:05:53 CET [30302]: [1-1] user=,db=,app=,client= LOG: checkpoint starting: end-of-recovery immediate wait
2018-01-10 12:05:57 CET [30397]: [1-1] user=[unknown],db=[unknown],app=[unknown],client=10.151.201.1 LOG: connection receive
d: host=10.151.201.1 port=51057
2018-01-10 12:05:57 CET [30397]: [2-1] user=patrol,db=pdbabe,app=[unknown],client=10.151.201.1 FATAL: no pg_hba.conf entry f
or host "10.151.201.1", user "patrol", database "pdbabe", SSL off
2018-01-10 12:06:57 CET [30410]: [1-1] user=[unknown],db=[unknown],app=[unknown],client=10.151.201.1 LOG: connection receive
d: host=10.151.201.1 port=51058
2018-01-10 12:06:57 CET [30410]: [2-1] user=patrol,db=pdbabe,app=[unknown],client=10.151.201.1 FATAL: no pg_hba.conf entry f
or host "10.151.201.1", user "patrol", database "pdbabe", SSL off
2018-01-10 12:07:15 CET [30302]: [2-1] user=,db=,app=,client= LOG: checkpoint complete: wrote 254519 buffers (99.4%); 0 tran
saction log file(s) added, 0 removed, 0 recycled; write=71.245 s, sync=10.241 s, total=81.566 s; sync files=39, longest=10.21
4 s, average=0.262 s
2018-01-10 12:07:15 CET [30299]: [94-1] user=,db=,app=,client= LOG: MultiXact member wraparound protections are now enabled
2018-01-10 12:07:15 CET [30297]: [4-1] user=,db=,app=,client= LOG: database system is ready to accept connections
2018-01-10 12:07:15 CET [30421]: [1-1] user=,db=,app=,client= LOG: autovacuum launcher started

14) After the recover the file deleted is in place:

pgpdbabe(at)elaawddbad00:/pgsql/pdbabe/admin/log> cd $PGDATA/pg_tblspc/16385/PG_9.4_201409291/16386
pgpdbabe(at)elaawddbad00:/pgsql/pdbabe/data/pdbabe/pg_tblspc/16385/PG_9.4_201409291/16386> ls -la | grep 49391
-rw------- 1 pgpdbabe pdbabe 16384 Jan 10 12:10 49391 <-----------
-rw------- 1 pgpdbabe pdbabe 24576 Jan 9 14:13 49391_fsm

15) Let's check the content of the table ... no errors raised but the table has 5 records and not 12 records (as it should have).

pgpdbabe(at)elaawddbad00:/pgsql/pdbabe/admin/log> psql
psql (9.4.9)
Type "help" for help.

pdbabe=# select order_date, substring(item, 1, 20) from test_corrupt ;
order_date | substring
-------------------------------+--------------
2018-01-01 00:07:00+01 | generic item
2018-01-01 00:08:00+01 | generic item
2018-01-01 00:09:00+01 | generic item
2018-01-01 00:10:00+01 | generic item
2018-01-10 02:57:01.106676+01 | new item
(5 rows)

In conclusion, the recover operation was successfully but the database is corrupted. The point is: when the recover find a missing file it
should exit with an error instead of going ahead and terminating succesfully. What do you think about this opinion? Am I missing something?

Thanks & Regards
Gabriele

> Il 8 gennaio 2018 alle 14.53 flumbador(at)virgilio(dot)it ha scritto:
>
>
> Hi postgresql is 9.4.9 on SUSE 11sp4 ... The backup is a filesystem backup with Networker.
> I confirm that I handle WAL archiving and backup_label file was in place when I started the recovery. The table is Persistent (Relpersistence = 'p').
> The postgresql was succesfully restored from another backup and now is up and running; i'm not able to provide further datails.
> I try to reproduce the issue on a test environment.
> I'll update the thread with further details as soon as I reproduce the issue.
> Thanks to all
> Best Regards
> Gabriele
>
> > >
> > Il 2 gennaio 2018 alle 16.38 Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> ha scritto:
> >
> > flumbador(at)virgilio(dot)it wrote:
> >
> > > > >
> > > During the restore 3 files were missing; these three files belong to a table with high transaction workload, and for sure during the backup many transactions had modified this table and those missing files. What is surprising to me is that even if the files were missing the recover phase ended successfully. I expect an error (for example file not found) raised when postgresql try to apply the wal entries related to this table and those files. After the recover I find that these three file has been created during recover but when I try to query the table I get the error:
> > >
> > > db4=# select count(*) from pgbench_accounts ;
> > >
> > > ERROR: could not read block 1999996 in file "pg_tblspc/16471/PG_9.4_201409291/16474/16593.15": read only 0 of 8192 bytes
> > >
> > > > >
> > Can you please get a backtrace from this error? I suspect that the
> > problem here is that the planner is trying to estimate cardinality by
> > descending a btree index, and failing because the index points to a part
> > of the table that has been truncated. But if the table has been
> > truncated because it's unlogged, then the index should have been
> > truncated too; and if it's a permanent table, then it has been truncated
> > improperly (because no index pointer to the truncated part should
> > remain).
> >
> > What files do exist for pg_tblspc/16471/PG_9.4_201409291/16474/16593* ?
> > (In particular I'm curious about whether 16593.14 exists and what size
> > it is)
> >
> > --
> > Álvaro Herrera https://www.2ndQuadrant.com/
> > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> >
> > >

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Rui DeSousa 2018-01-10 17:04:02 Re: Issue on restore / recover
Previous Message David G. Johnston 2018-01-10 16:13:21 Re: set not null column using existing constraint