From: | Guillaume Lelarge <guillaume(at)lelarge(dot)info> |
---|---|
To: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> |
Cc: | Dennis Kögel <dk(at)neveragain(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com> |
Subject: | Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves) |
Date: | 2014-12-31 07:44:14 |
Message-ID: | CAECtzeWhC2-2ppnR3W1dWawNrnizdMTgLOYCj6Yb6DajCbsk3A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
2014-12-12 14:58 GMT+01:00 Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>:
> On 12/10/2014 04:32 PM, Dennis Kögel wrote:
>
>> Hi,
>>
>> Am 04.09.2014 um 17:50 schrieb Jehan-Guillaume de Rorthais <
>> jgdr(at)dalibo(dot)com>:
>>
>>> Since few months, we occasionally see .ready files appearing on some
>>> slave
>>> instances from various context. The two I have in mind are under 9.2.x.
>>> […]
>>> So it seems for some reasons, these old WALs were "forgotten" by the
>>> restartpoint mechanism when they should have been recylced/deleted.
>>>
>>
>> Am 08.10.2014 um 11:54 schrieb Heikki Linnakangas <
>> hlinnakangas(at)vmware(dot)com>:
>>
>>> 1. Where do the FF files come from? In 9.2, FF-segments are not supposed
>>> to created, ever. […]
>>> 2. Why are the .done files sometimes not being created?
>>>
>>
>>
>>
>> We’ve encountered behaviour which seems to match what has been described
>> here: On Streaming Replication slaves, there is an odd piling up of old
>> WALs and .ready files in pg_xlog, going back several months.
>>
>> The fine people on IRC have pointed me to this thread, and have
>> encouraged me to revive it with our observations, so here we go:
>>
>> Environment:
>>
>> Master, 9.2.9
>> |- Slave S1, 9.2.9, on the same network as the master
>> '- Slave S2, 9.2.9, some 100 km away (occassional network hickups; *not*
>> a cascading replication)
>>
>> wal_keep_segments M=100 S1=100 S2=30
>> checkpoint_segments M=100 S1=30 S2=30
>> wal_level hot_standby (all)
>> archive_mode on (all)
>> archive_command on both slaves: /bin/true
>> archive_timeout 600s (all)
>>
>>
>> - On both slaves, we have „ghost“ WALs and corresponding .ready files
>> (currently >600 of each on S2, slowly becoming a disk space problem)
>>
>> - There’s always gaps in the ghost WAL names, often roughly 0x20, but not
>> always
>>
>> - The slave with the „bad“ network link has significantly more of these
>> files, which suggests that disturbances of the Streaming Replication
>> increase chances of triggering this bug; OTOH, the presence of a name gap
>> pattern suggests the opposite
>>
>> - We observe files named *FF as well
>>
>>
>> As you can see in the directory listings below, this setup is *very* low
>> traffic, which may explain the pattern in WAL name gaps (?).
>>
>> I’ve listed the entries by time, expecting to easily match WALs to their
>> .ready files.
>> There sometimes is an interesting delay between the WAL’s mtime and the
>> .ready file — especially for *FF, where there’s several days between the
>> WAL and the .ready file.
>>
>> - Master: http://pgsql.privatepaste.com/52ad612dfb
>> - Slave S1: http://pgsql.privatepaste.com/58b4f3bb10
>> - Slave S2: http://pgsql.privatepaste.com/a693a8d7f4
>>
>>
>> I’ve only skimmed through the thread; my understanding is that there were
>> several patches floating around, but nothing was committed.
>> If there’s any way I can help, please let me know.
>>
>
> Yeah. It wasn't totally clear how all this should work, so I got
> distracted with other stuff an dropped the ball; sorry.
>
> I'm thinking that we should change the behaviour on master so that the
> standby never archives any files from older timelines, only the new one
> that it generates itself. That will solve the immediate problem of old WAL
> files accumulating, and bogus .ready files appearing in the standby.
> However, it will not solve the bigger problem of how do you ensure that all
> WAL files are archived, when you promote a standby server. There is no
> guarantee on that today anyway, but this will make it even less reliable,
> because it will increase the chances that you miss a file on the old
> timeline in the archive, after promoting. I'd argue that that's a good
> thing; it makes the issue more obvious, so you are more likely to encounter
> it in testing, and you won't be surprised in an emergency. But I've started
> a new thread on that bigger issue, hopefully we'll come up with a solution (
> http://www.postgresql.org/message-id/548AF1CB.80702@vmware.com)
>
> Now, what do we do with the back-branches? I'm not sure. Changing the
> behaviour in back-branches could cause nasty surprises. Perhaps it's best
> to just leave it as it is, even though it's buggy.
>
>
As long as master is fixed, I don't actually care. But I agree with Dennis
that it's hard to see what's been commited with all the different issues
found, and if any commits were done, in which branch. I'd like to be able
to tell my customers: update to this minor release to see if it's fixed,
but I can't even do that.
--
Guillaume.
http://blog.guillaume.lelarge.info
http://www.dalibo.com
From | Date | Subject | |
---|---|---|---|
Next Message | Guillaume Lelarge | 2014-12-31 07:46:22 | Re: Maximum number of WAL files in the pg_xlog directory |
Previous Message | Robert Haas | 2014-12-31 05:32:37 | Re: orangutan seizes up during isolation-check |