Re: hidden junk files in ...data/base/oid/

From: Andrej Vanek <andrej(dot)vanek(dot)sk(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: hidden junk files in ...data/base/oid/
Date: 2014-05-28 13:04:41
Message-ID: CAFNFRyGu=dEVoyHWsjVKxbR8xbO6ussOMzPafR1dd=h2fGyBbw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello,

thanks for your answer.

I've identified problems in my cluster agent script. It is a custom written
script with built in automated recovery of failed slave. It was written in
time when postgres 9.1 streaming replications feature was just in beta
release and there was no postgres agent for streaming replications
available out there.
The problem was that the failed slave recovery was hardcoded into start
operation. But this start operation was aborted by pacemaker due to startup
operation timeout. This occured before having finished backup from master
to failed slave (in case of bigger database). This is the point where rsync
could be aborted and left over temporary junk files. There was no cleanup
before re-running the backup from master (using rsync). This may be the
reason why there may be left rsync temporary files.
Second problem identified is what you write: copying stuff from one
direction first, then failed over, then copied in the opposite direction.
This was caused because my agent was missing the lock file that standard
clusterlabs pgsql agent uses to avoid starting failed master in case of
double failure followed by reboot.

Now I'm migrating to the standard pacemaker's postgres cluster agent
provided by clusterlabs.org to avoid such issues. It is surely much better
tested by plenty of installations worldwide with community feedback.

In addition I need to automate single (master or slave) failure recovery as
much as possible. For this purpose I plan to introduce a new resource on
top of pgsql resource which would recover failed pgsql slave(or master) in
case master is active on another node (I use only two node cluster). Manual
recovery by operator would be needed for cases when postgres on both nodes
is down to avoid accidental data loss.
Do you know whether there is such cluster agent already available?

Best Regards, Andrej

2014-05-27 16:09 GMT+02:00 Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>:

> Andrej Vanek wrote:
> > Hello,
> >
> > solved.
> > This is not a postgres issue.
> >
> > The system was used in HA-cluster with streaming replications.
> > The hidden files I asked for were created probably by broken (killed)
> > rsync. It uses such file-format for temporary files used during copying.
> >
> > This rsync is used by master to slave database synchronization (full
> > on-line backup of master database to slave node) before starting postgres
> > in hot-standby mode on slave the node...
>
> You not only have leftover first-order rsync temp files (.NNNNN.uvwxyz)
> -- but also when those temp files were being copied over by another
> rsync run, which created temp files for the first-order temp files,
> leaving you with second-order temp files (..NNNNN.uvwxyz.opqrst). Not
> nice. I wonder if this is anywhere near sanity -- it looks like you're
> copying stuff from one direction first, then failed over, then copied in
> the opposite direction. I would have your setup reviewed real closely,
> to avoid data-corrupting configuration mistakes. I have seen people
> make subtle mistakes in their configuration, causing their whole HA
> setups to be completely broken.
>
> --
> Álvaro Herrera http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Leonardo M. Ramé 2014-05-28 13:19:24 Error while upgrading from 8.4 to 9.3
Previous Message Dmitry Samonenko 2014-05-28 09:04:36 Re: Fwd: libpq: indefinite block on poll during network problems