From: | bricklen <bricklen(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: ERROR: could not open relation base/2757655/6930168: No such file or directory -- during warm standby setup |
Date: | 2010-12-29 21:53:20 |
Message-ID: | AANLkTinLZ=9J4TV+gRo=vgnKVAsQ6gG5CpGqeVtJOa47@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Wed, Dec 29, 2010 at 12:11 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> bricklen <bricklen(at)gmail(dot)com> writes:
>> On Wed, Dec 29, 2010 at 11:35 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> What can you tell us about what was happening on the source DB while
>>> the backup was being taken?
>
>> The source db has between 1000 and 3000 transactions/s, so is
>> reasonably volatile. The two tables in question are not accessed very
>> heavily though.
>
>> Looking at the ctid and xmin between both databases, no, they don't
>> seem to match exactly. Pardon my ignorance, but would those have
>> changed due to vacuums, analyze, or any other forms of access?
>
> The difference in ctid, and the values of xmin and relfrozenxid,
> seems to confirm my suspicion that this wasn't just random cosmic rays.
> You did something on the source DB that rewrote the table with a new
> relfilenode (possibly CLUSTER or some form of ALTER TABLE; plain VACUUM
> or ANALYZE wouldn't do it). And for some reason the standby hasn't
> picked up that change in the pg_class row. I suspect the explanation
> is that your technique for setting up the standby is flawed. You can't
> just rsync and have a valid snapshot of the DB --- you need to be sure
> that enough WAL gets replayed to fix any inconsistencies arising from
> the time-extended nature of the rsync operation. But you didn't say
> exactly how you did that.
>
Definitely no CLUSTER commands were issued, and there should have been
no ALTER commands issued (yesterday was a holiday, no one was here).
Would a TRUNCATE have the same effect though? I grep'd through our
application, and it appears that at least 3 tables get truncated, one
of them several times per hour. The often-truncated table wasn't one
of the bad ones, but the others are the ones I've already identified
as non-existent.
The standby setup process was
On source server:
- start shipping WAL files to standby @ /var/lib/pgsql/WAL_Archive dir
- "select pg_start_backup('standby')"
- rsync -av -e "ssh -p 9001" --progress --partial -z
/var/lib/pgsql/data postgres(at)standby-tunnel:/var/lib/pgsql/
- "select pg_stop_backup()"
On standby
- recovery.conf file:
restore_command = 'pg_standby -d -s 2 -t /WAL_Archive/go_live
/WAL_Archive %f %p %r 2>>/var/lib/pgsql/data/pg_log/standby.log'
Note: /WAL_Archive is a symlink to /var/lib/pgsql/WAL_Archive, that
didn't get touched though, as it is on the same directory level as the
data dir.
pg_ctl -D /var/lib/pgsql/data start
touch /var/lib/pgsql/WAL_Archive/go_live
Looking over the log of the restore process, there are a couple
messages that might be helpful in pinpointing the problem area.
2010-12-29 08:07:34 PST [16589]: [2-1] (user=) (rhost=) LOG: starting
archive recovery
2010-12-29 08:07:34 PST [16589]: [3-1] (user=) (rhost=) LOG:
restore_command = 'pg_standby -d -s 2 -t /WAL_Archive/go_live
/WAL_Archive %f %p %r 2>>/var/lib/pgsql/data/pg_log/standby.log'
2010-12-29 08:07:34 PST [16589]: [4-1] (user=) (rhost=) LOG:
recovery_end_command = 'rm -f /WAL_Archive/go_live'
2010-12-29 08:07:46 PST [16589]: [5-1] (user=) (rhost=) LOG: restored
log file "0000000100001FB400000080" from archive
2010-12-29 08:07:46 PST [16589]: [6-1] (user=) (rhost=) LOG:
automatic recovery in progress
2010-12-29 08:07:46 PST [16589]: [7-1] (user=) (rhost=) LOG: restored
log file "0000000100001FB40000007B" from archive
2010-12-29 08:07:46 PST [16589]: [8-1] (user=) (rhost=) LOG: redo
starts at 1FB4/7BD68388
2010-12-29 08:07:46 PST [16589]: [9-1] (user=) (rhost=) LOG:
consistent recovery state reached
2010-12-29 08:07:46 PST [16589]: [10-1] (user=) (rhost=) LOG: file
"pg_clog/0494" doesn't exist, reading as zeroes
2010-12-29 08:07:46 PST [16589]: [11-1] (user=) (rhost=) CONTEXT:
xlog redo commit: 2010-12-28 22:35:18.152971-08
2010-12-29 08:08:21 PST [16589]: [12-1] (user=) (rhost=) LOG:
restored log file "0000000100001FB40000007C" from archive
2010-12-29 08:08:22 PST [16589]: [13-1] (user=) (rhost=) LOG:
restored log file "0000000100001FB40000007D" from archive
2010-12-29 08:08:22 PST [16589]: [14-1] (user=) (rhost=) LOG:
restored log file "0000000100001FB40000007E" from archive
... thousands of WAL logs later
2010-12-29 08:43:51 PST [16589]: [2529-1] (user=) (rhost=) LOG:
restored log file "0000000100001FBE0000005B" from archive
2010-12-29 08:43:59 PST [16589]: [2530-1] (user=) (rhost=) LOG:
restored log file "0000000100001FBE0000005C" from archive
2010-12-29 08:43:59 PST [16589]: [2531-1] (user=) (rhost=) LOG: could
not open file "pg_xlog/0000000100001FBE0000005D" (log file 8126,
segment 93): No such file or directory
2010-12-29 08:43:59 PST [16589]: [2532-1] (user=) (rhost=) LOG: redo
done at 1FBE/5CFFEA58
2010-12-29 08:43:59 PST [16589]: [2533-1] (user=) (rhost=) LOG: last
completed transaction was at log time 2010-12-29 08:44:14.784874-08
2010-12-29 08:43:59 PST [16589]: [2534-1] (user=) (rhost=) LOG:
restored log file "0000000100001FBE0000005C" from archive
2010-12-29 08:44:12 PST [16589]: [2535-1] (user=) (rhost=) LOG:
selected new timeline ID: 2
2010-12-29 08:44:24 PST [16589]: [2536-1] (user=) (rhost=) LOG:
archive recovery complete
2010-12-29 08:44:24 PST [25155]: [1-1] (user=) (rhost=) LOG:
autovacuum launcher started
2010-12-29 08:44:24 PST [16587]: [2-1] (user=) (rhost=) LOG: database
system is ready to accept connections
Do those log entries look suspicious?
From | Date | Subject | |
---|---|---|---|
Next Message | Gary Fu | 2010-12-29 22:19:40 | what's the message: CONTEXT: PL/pgSQL function "proc_sub_b" line 7 at assignment |
Previous Message | Tom Lane | 2010-12-29 20:11:01 | Re: ERROR: could not open relation base/2757655/6930168: No such file or directory -- during warm standby setup |