Re: Issue with WAL logs temporary not replaying

From: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
To: Eugene Pirogov <iamexile(at)gmail(dot)com>, pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Issue with WAL logs temporary not replaying
Date: 2018-01-31 19:29:20
Message-ID: e35c3336-d0c4-c11a-079d-84a22100dbcd@aklaver.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 01/31/2018 09:56 AM, Eugene Pirogov wrote:
> Hi,
>
> We've experienced a DB incident related to WAL. The issue is now
> somewhat resolved, and I'm trying to understand what could have caused
> the problem to occur in the first place. Below is a full chronicle of an
> incident we've ran into.
>
> At some point we realized almost no new data was coming in the DB during
> the day. We quickly looked and realized the following:
>
> * our application was still writing data, unaware that there any
> issues on the DB end, like if nothing happened at all,
> * the database was accepting connections (we were able to connect and
> query),
> * the database was writing regular text logs,
> * the database was not writing new data to the tables,
> * the logical DB replica actually had the expected records replicated
> and present in the replica tables,
> * so we made assumption original source WAL data existed and was
> correct (given the fact logical DB is relying on WAL).
>
> We tried to restart the DB, but it would not come up. Instead, a few
> core dumps in the postgres data folder were generated during our attempts.
>
> This is what we saw in the postgres logs:
>
> LOG:  invalid resource manager ID 112 at 3/89005B10
> LOG:  invalid primary checkpoint record
> LOG:  invalid resource manager ID 138 at 3/89005208
> LOG:  invalid secondary checkpoint record
> PANIC:  could not locate a valid checkpoint record
>
>
> To aid this we used pg_resetxlog. After running the pg_resetxlog
> command, the DB successfully started and slowly began recovering the
> data from WAL, automatically. During the recovery process, we ran VACUUM
> ANALYSE couple of time, which seems to have somehow aided the speed of
> recovery process (though I could have been a confirmation bias).
>
> The DB was fully recovered in 30 minutes.
>
> Upon closer inspection, however, we realized we've lost two records from
> a particular table. The records were replicated to logical replica
> though, and further inspection showed their insertion timestamp sometime
> during the day, long after the database supposedly stopped applying new
> data from WAL to tables.
>
> We don't know the reason why these two records were gone (no DELETE
> statements ever happen in the table in question). The records in
> question existed in replica table, luckly.
>
> The logs for the entire day when the incident happened are available
> here
> <https://gist.githubusercontent.com/gmile/6df9ce393b11fb369856b84c4e9f755a/raw/5ab86c10332f9aef01abf16cb4989334bc7b02a2/postgresql.log> (1.6
> Mb). Each log line was prepended with a timestamp. Also, we have the
> core dumps that were generated during attempts to restart the database.

What the role of the user 'db'?

Any idea what is causing this?:

WARNING: concurrent insert in progress within table "audit_log"
...repeated... many times

WARNING: concurrent insert in progress within table
"declarations_status_hstr"
...repeated many times

Are either of these tables the one that had the two missing records?

>
> My questions are:
>
> * what would be some typical reasons for primary/secondary WAL
> checkpoints to fail? Disk issues? File system issues? OS issues? I'm
> running out of ideas here.
>
> Is there any way I could separate primary and secondary checkpoints
> in space, so to increase postgres's chances to restore one of them?
>
> * how come postgresql would stop applying WAL logs to tables, and yet
> keep silence? There was no indication WAL is not being applied,
> postgres log would be silent (even though it kept writing other
> commodity messages, such as "incomplete startup packet" and
> malformed query attempts).
>
> Due to this, we noticed the issue only hours after postgres
> supposedly stopped applying the WAL logs to tables.
>
> * Is there any other log (like some lower level system log?) where I
> could see the status of writing / reading / applying WAL segments?
>
> * What measures could be done to prevent this in future, or at least
> make the situation easier to recover from (e.g. without data loss)?
>
> Thank you in advance.
>
> --
> Eugene
>
> http://www.gmile.me

--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Jan Wieck 2018-01-31 20:51:50 Re: ERROR: invalid memory alloc request size 1073741824
Previous Message David G. Johnston 2018-01-31 18:58:25 Re: Information on savepoint requirement within transctions