Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)

From: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>, David Powers <dpowers(at)janestreet(dot)com>
Subject: Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date: 2013-05-14 13:48:55
Message-ID: CADbMkNPP0xA2S+b625my3YznBcToZkCaVktTaaP4H5ZX_WLdSw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

It's on the production database and the streaming replica. But not on the
snapshot.

production
-rw------- 1 postgres postgres 312778752 May 13 21:28
/database/postgres/base/16416/291498116.3

streaming replica
-rw------- 1 postgres postgres 312778752 May 13 23:50
/database/postgres/base/16416/291498116.3

Is there a way to find out what the file contains?

We just got some more information. All of the following was done / seen in
the logs of the snapshot database.

After we saw this we run a vacuum full on the table we suspect to be backed
by this file. This happened:

WARNING: concurrent insert in progress within table "js_equity_daily_diff"

Over and over again. So we killed the VACUUM full by hitting ctrl-c again
and again. After that the logs contained:

fgrep ERROR postgresql-2013-05-14.csv | head
2013-05-14 00:23:16.028 EDT,"postgres_prod","proddb_testing",15698,"
172.24.65.103:54536",5191bc31.3d52,3,"SELECT",2013-05-14 00:23:13
EDT,2/3330,0,ERROR,58P01,"could not open file ""base/16416/291498116.3""
(target block 431006): No such file or directory",,,,,"SQL statement
""select max(equity_date) from js_equity_daily_diff""

The above was the original error that made us investigate.

2013-05-14 09:21:47.121 EDT,"postgres_prod","proddb_testing",21002,"
172.27.41.24:36815",51923a67.520a,38756,"VACUUM",2013-05-14 09:21:43
EDT,2/21611,645995272,ERROR,57014,"canceling statement due to user
request",,,,,,"vacuum full js_equity_daily_diff",,,"psql"
2013-05-14 09:22:04.700 EDT,"postgres_prod","proddb_testing",21008,"
172.27.41.24:36831",51923a75.5210,115909,"VACUUM",2013-05-14 09:21:57
EDT,2/21618,645995273,ERROR,57014,"canceling statement due to user
request",,,,,,"vacuum full js_equity_daily_diff",,,"psql"

Than us cancelling the vacuum full

2013-05-14 09:22:13.947 EDT,,,30911,,51919d78.78bf,1,,2013-05-13 22:12:08
EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied ---
flushed only to 1CEE/31266090",,,,,"writing block 0 of relation
base/16416/291498116",,,,""
2013-05-14 09:22:14.964 EDT,,,30911,,51919d78.78bf,2,,2013-05-13 22:12:08
EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied ---
flushed only to 1CEE/31266090",,,,,"writing block 0 of relation
base/16416/291498116",,,,""

And after that these started appearing in logs (and they get repeated every
second now:

[root(at)nyc-dbc-001 pg_log]# fgrep ERROR postgresql-2013-05-14.csv | tail -n
2
2013-05-14 09:47:43.301 EDT,,,30911,,51919d78.78bf,3010,,2013-05-13
22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not
satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block 0 of
relation base/16416/291498116",,,,""
2013-05-14 09:47:44.317 EDT,,,30911,,51919d78.78bf,3012,,2013-05-13
22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not
satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block 0 of
relation base/16416/291498116",,,,""

There are no earlier ERROR's in the logs.

2013-05-14 09:38:03.115 EDT,,,30911,,51919d78.78bf,1868,,2013-05-13
22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not
satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block 0 of
relation base/16416/291498116",,,,""
2013-05-14 09:38:03.115 EDT,,,30911,,51919d78.78bf,1869,,2013-05-13
22:12:08 EDT,,0,WARNING,58030,"could not write block 0 of
base/16416/291498116","Multiple failures --- write error might be
permanent.",,,,,,,,""

The disk is not full nor are there any messages in the kernel logs.

Cheers,

Bene

On Tue, May 14, 2013 at 9:27 AM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> On 14.05.2013 14:57, Benedikt Grundmann wrote:
>
>> Today we have seen this on our testing database instance:
>>
>> ERROR: could not open file "base/16416/291498116.3" (target block
>> 431006):
>> No such file or directory
>>
>> That database get's created by rsyncing the LVM snapshot of the standby,
>> which is a readonly backup of proddb
>> using streaming replication.
>>
>> We do not put the standby database into backup mode before rsyncing the
>> LVM
>> snapshot, trusting postgres ability to recover
>> from crash. Now we are not sure anymore... Thoughts? It's worth noting
>> that we have been using this method of taking a backup
>> for a long long time, but we only recently (after 9.2 came out we switched
>> to doing a streaming replica and then to take the
>> "frozen snapshot" backup method from the streaming replica).
>>
>> Let me know if there is more information you need, or if this is a known
>> problem.
>>
>
> That certainly should work. I'd suggest that you narrow down the problem a
> bit more. Is the file present in the original system? Anything else that
> might be related in the logs?
>
> - Heikki
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2013-05-14 14:05:40 Re: Incomplete description of pg_start_backup?
Previous Message Amit Kapila 2013-05-14 13:42:27 Move unused buffers to freelist