Re: Mysterious DB reset

From: Israel Brewster <israel(at)eraalaska(dot)net>
To: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Mysterious DB reset
Date: 2014-03-06 18:43:32
Message-ID: 84AF018A-5F37-4314-877A-BA0CE114FEE9@eraalaska.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mar 6, 2014, at 9:03 AM, Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com> wrote:

> On 03/06/2014 09:33 AM, Israel Brewster wrote:
>> For starters, this happened again this morning (no data prior to 4:45 am and sequence reset), so whatever is going on appears to be reoccurring. Also, I forgot to mention if it is significant: this is running on slackware liunux 14.0
>>
>> On Mar 5, 2014, at 1:00 PM, Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com> wrote:
>>
>>> On 03/05/2014 10:22 AM, Israel Brewster wrote:
>>>> My first thought was "Oh, I must have a typo in my
>>>> cleanup routine, such that it is deleting all records rather than only
>>>> those a week old, and it's just that no one has noticed until now". So I
>>>> looked at that, but changing the delete to a select appeared to produce
>>>> the proper results, in that no records were selected:
>>>
>>> Well it would, if the records only go back to 4 AM this morning. In other words if no records exist before 4 AM today, no records exist before 7 days ago also or am I missing something?
>>
>> If the delete is correct, you are absolutely right. My first theory, however, was that I made a typo, and the delete was deleting ALL records, in which case changing it to a select would select all records. As it did not, that seems to confirm the delete is correct, and therefore not the problem.
>>
>>> A sequence is just a special table.
>>>
>>> So what does SELECT * from the sequence show?
>>
>> tracking=> SELECT * FROM data_id_seq;
>> sequence_name | last_value | start_value | increment_by | max_value | min_value | cache_value | log_cnt | is_cycled | is_called
>> ---------------+------------+-------------+--------------+---------------------+-----------+-------------+---------+-----------+-----------
>> data_id_seq | 1184 | 1 | 1 | 9223372036854775807 | 1 | 1 | 16 | f | t
>>
>>
>>>
>>>>
>>>> Also odd is that my cleanup script runs at 1am. I have records of there
>>>> being new data in the database up to 3:51am, but the oldest record
>>>> currently in the DB is from 4:45am (as specified by the default of now()
>>>> on the column). So I know records were added after my delete command
>>>> ran, but before this reset occurred.
>>>
>>> I am not sure what you are calling the 'reset'?
>>> Did something happen between 3:51 AM and 4:45 AM?
>>
>> Yes: All my data was deleted and the sequence reset to 1.
>>
>>> Also not sure why you call the 4:45 AM record the oldest, when you say you can identify records from 3:51 AM?
>>
>> As I mentioned, I archive the records to permanent storage. This archive process happens every hour (for various reasons). That is how I know we had records for 3:51 am: they exist in the permanent archive. However, they don't exist in the local database any more.
>
> Well something is happening. See my notes on logging below to help track down the cause.

Yep.

>
>>
>>>
>>>>
>>>> So my question is, aside from someone going in and mucking about in the
>>>> wee hours of the morning, what could possibly cause this behavior? What
>>>> sort of event could cause all data to be deleted from the table, and the
>>>> sequence to be reset? Especially while there is an active connection?
>>>> Thanks for any ideas, however wild or off the wall :-)
>>>
>>> What is in the Postgres/system logs for the time period(s) you mention?
>>
>> The postgres log has a lot of errors in it, some of which MAY explain the issue. For example:
>>
>> cp: cannot create regular file '/mnt/pglogs/000000010000000400000094': Permission denied
>> LOG: archive command failed with exit code 1
>> DETAIL: The failed archive command was: test ! -f /mnt/pglogs/000000010000000400000094 && cp pg_xlog/000000010000000400000094 /mnt/pglogs/000000010000000400000094
>> WARNING: transaction log file "000000010000000400000094" could not be archived: too many failures
>> LOG: received smart shutdown request
>> LOG: autovacuum launcher shutting down
>> LOG: shutting down
>> LOG: database system is shut down
>
> Might be good to explain your archive setup.

Ok, here goes: We have the primary system which receives the data and handles all requests for said data. There is also a hot standby server keep in sync with streaming replication. The WALs are archived to a NFS share on this machine.

Once an hour a python script runs that a) Selects all unsynced records from the postgresql db, b) stores a subset of them in our permanent archive, and c) marks the previously selected records as synced (UPDATE data SET syncd=true WHERE id in (...) )

Additionally, I have a) a script that runs at 8:00pm every evening that uses pg_dump to dump the contents of the database to a backup file, and b) a script that runs at 8:00 each morning that rsync's various config files and scripts (such as my data retrieval scripts) from the primary machine to a backup location on the secondary machine.

None of the scripts run anywhere near the apparent 4:40ish cutoff time for my data

Make sense? Probably not the best setup, but then that's what happens when you figure out stuff for yourself rather than having formal training :-) I'm DEFINITELY open to suggestions :-)

>
>>
>> However, there are no timestamps on any of the entries (can I fix that?), so I don't know if those are current entries, or from back before I got the mount for the logs working. At this time, the mount point IS working correctly, and from what I can tell so is the archive command. The latest entry is from yesterday (modify date on the log shows Mar 5, 9:21, when I was messing with it yesterday), however, so there are no entries from this morning when it happened again. I don't see anything of interest in the syslog or messages log.
>
> Yes you can, timestamps and a lot more. For all the details go here:
>
> http://www.postgresql.org/docs/9.2/interactive/runtime-config-logging.html
>
> At the least I would:
>
> log_statement = 'mod'
>
> log_connections = on
> log_disconnections = on
>
> log_line_prefix = '%u-%m-%x'

I'll get those in the config, and we'll see what happens tomorrow morning. Hopefully that will give more information. Thanks for the link and information!

>
>>
>>>
>>>>
>>>> -----------------------------------------------
>>>> Israel Brewster
>>>> Computer Support Technician II
>>>> Era Alaska
>>>> 5245 Airport Industrial Rd
>>>> Fairbanks, AK 99709
>>>> (907) 450-7250 x7293
>>>> -----------------------------------------------
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Adrian Klaver
>>> adrian(dot)klaver(at)aklaver(dot)com

-----------------------------------------------
Israel Brewster
Computer Support Technician II
Era Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7250 x7293
-----------------------------------------------

Attachment Content-Type Size
Israel Brewster.vcf text/directory 417 bytes
unknown_filename text/plain 2 bytes

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2014-03-06 19:03:20 Re: Mysterious DB reset
Previous Message Adrian Klaver 2014-03-06 18:03:06 Re: Mysterious DB reset