how to check that recovery is complete

From: Dmitry O Litvintsev <litvinse(at)fnal(dot)gov>
To: "pgsql-general(at)lists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: how to check that recovery is complete
Date: 2020-11-05 19:03:54
Message-ID: SA0PR09MB71613DE760DBADC8712FDF17B9EE0@SA0PR09MB7161.namprd09.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

I have a workflow where I recover from PITR backup and run a query on it. The program that runs query
checks that it can connect to database in a loop, until it can, and then runs the query.
This has worked fine far. Recently I upgraded to 11 and I see that I can connect to DB while recovery is
not complete yet. See this:

< 2020-11-05 03:34:36.114 CST >LOG: starting archive recovery
< 2020-11-05 03:34:36.590 CST >LOG: restored log file "00000001000002EF000000F9" from archive
< 2020-11-05 03:34:36.641 CST >LOG: redo starts at 2EF/F9000028
...
< 2020-11-05 03:34:46.392 CST >LOG: restored log file "00000001000002F000000008" from archive
< 2020-11-05 03:34:46.658 CST 127.0.0.1 >FATAL: the database system is starting up
< 2020-11-05 03:34:47.028 CST >LOG: restored log file "00000001000002F000000009" from archive

You can see above fail to connect , but sometime into recover I see;

< 2020-11-05 04:07:51.987 CST >LOG: restored log file "00000001000002F200000029" from archive
< 2020-11-05 04:08:23.195 CST 127.0.0.1 >ERROR: canceling statement due to conflict with recovery
< 2020-11-05 04:08:23.195 CST 127.0.0.1 >DETAIL: User query might have needed to see row versions that must be removed.
< 2020-11-05 04:08:23.195 CST 127.0.0.1 >STATEMENT: select count(*) from file
< 2020-11-05 04:08:23.195 CST >FATAL: terminating connection due to administrator command
< 2020-11-05 04:08:23.195 CST >STATEMENT: select count(*) from file
< 2020-11-05 04:08:23.195 CST >FATAL: terminating connection due to administrator command
< 2020-11-05 04:08:23.195 CST >STATEMENT: select count(*) from file
< 2020-11-05 04:08:23.232 CST >LOG: background worker "parallel worker" (PID 13577) exited with exit code 1
< 2020-11-05 04:08:23.244 CST >LOG: background worker "parallel worker" (PID 13578) exited with exit code 1
< 2020-11-05 04:08:23.244 CST 127.0.0.1 >FATAL: terminating connection due to conflict with recovery
< 2020-11-05 04:08:23.244 CST 127.0.0.1 >DETAIL: User query might have needed to see row versions that must be removed.
< 2020-11-05 04:08:23.244 CST 127.0.0.1 >HINT: In a moment you should be able to reconnect to the database and repeat your command.
< 2020-11-05 04:08:25.354 CST >LOG: restored log file "00000001000002F20000002A" from archive
< 2020-11-05 04:08:55.555 CST 127.0.0.1 >ERROR: canceling statement due to conflict with recovery
< 2020-11-05 04:08:55.555 CST 127.0.0.1 >DETAIL: User query might have needed to see row versions that must be removed.
< 2020-11-05 04:08:55.555 CST 127.0.0.1 >STATEMENT: select count(*) from file
< 2020-11-05 04:08:55.556 CST >FATAL: terminating connection due to administrator command
< 2020-11-05 04:08:55.556 CST >STATEMENT: select count(*) from file
< 2020-11-05 04:08:55.561 CST >FATAL: terminating connection due to administrator command
< 2020-11-05 04:08:55.561 CST >STATEMENT: select count(*) from file
< 2020-11-05 04:08:55.640 CST >LOG: background worker "parallel worker" (PID 13683) exited with exit code 1
< 2020-11-05 04:08:55.653 CST >LOG: background worker "parallel worker" (PID 13684) exited with exit code 1
< 2020-11-05 04:08:55.653 CST 127.0.0.1 >FATAL: terminating connection due to conflict with recovery
< 2020-11-05 04:08:55.653 CST 127.0.0.1 >DETAIL: User query might have needed to see row versions that must be removed.
< 2020-11-05 04:08:55.653 CST 127.0.0.1 >HINT: In a moment you should be able to reconnect to the database and repeat your command.
< 2020-11-05 04:09:00.307 CST >LOG: restored log file "00000001000002F20000002B" from archive

As you can see a query "select count(*) from file" failed due to table not being restored yet. BUT connection was allowed
before DB was ready.... Only few hours after ;

< 2020-11-05 09:31:30.319 CST >LOG: archive recovery complete
< 2020-11-05 09:34:51.729 CST >LOG: database system is ready to accept connections

After which the query runs fine without errors. This is bad because I see that select count(*) takes progressively
longer to execute and the count(*) is not what I am interested in. I run the "real" query after that. As a result I add
hours to program execution time.

Is there a more robust method to it? Ideally I do not want to be able to connect to db until :

< 2020-11-05 09:31:30.319 CST >LOG: archive recovery complete
< 2020-11-05 09:34:51.729 CST >LOG: database system is ready to accept connections

And I believe this was the behavior before upgrade. If connection can't be disabled, how can I detect
condition "database system is ready to accept connections"

I believe "pg_isready" utility would succeed once it can connect. And as can see I could
connect way before DB is really ready.

Thanks!
Dmitry

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Gabi Draghici 2020-11-05 21:07:12 pgagent
Previous Message Adrian Klaver 2020-11-05 16:57:53 Re: Fwd: JSONB order?