From: | Daniele Posenato <daniele(dot)posenato(at)smartec(dot)ch> |
---|---|
To: | Francisco Olarte <folarte(at)peoplecall(dot)com> |
Cc: | "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org> |
Subject: | Re: BUG #12785: server process (PID 2872) was terminated by exception 0xC0000005 |
Date: | 2015-02-27 12:27:11 |
Message-ID: | 0456786ECC6C234BBBC1465DC197F37B753D2DF8@VHC-EX01.roctest.lan |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
Hi Francisco,
This email is just to inform you that I think we have discovered the source of the issue. The problem seems to be related to the hard disk, even if the server was new the disk started to show some anomalies, now for example we have a lot of logs saying that the disk “needs to be checked for consistency” and “the file system structure on the disk is corrupted”. Also the postgreslq cannot be started.
So that you were right, it was an hardware problem.
Thanks again for the support and the time you spend on it, I have really appreciated it.
Best regards
Daniele
From: Francisco Olarte [mailto:folarte(at)peoplecall(dot)com]
Sent: Monday, 23 February, 2015 6:56 PM
To: Daniele Posenato
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: [BUGS] BUG #12785: server process (PID 2872) was terminated by exception 0xC0000005
Hi Daniele:
On Mon, Feb 23, 2015 at 6:09 PM, Daniele Posenato <daniele(dot)posenato(at)smartec(dot)ch<mailto:daniele(dot)posenato(at)smartec(dot)ch>> wrote:
Thank you a lot for the answer, I really appreciate it. I will try to do what you have suggested and then I will let you know.
That's ok, but I doubt I can help you more ( I abandoned Windows more than a dozen years ago, haven't looked back, although I still remember how that code appeared when I did something wrong in my programs ).
Just for information the problem has occurred again since the last email and always on the same query. I could understand a crash of the service on performing an update or a delete, but I have some difficulties to understand this on a select. If it was an hardware problem I would expect the service to crash also on other actions and not randomly (about once per week) only on a specific select (that is executed every 10 seconds).
Is that query consuming a lot of your resources? ( It may be due to it being lengthy or just frequent ) because in that case it makes sense.
In many applications I have 99.9% of the work / ram usage are selects, so a random crash is normally going to hit me in one of this.
On the crashing on select stuff. Suppose you have a faulty sector or ram location. When you write to it ( upd or del ) nothing happens, it just sotres the bad value, when you read it ( select, part 1, reading from disk/ram ) nothing happens, you just get bad data, say a null pointer, then when you use ( select part 2 ) you get the fault. In fact, if a ram location loses data written you do not notice it on writting it, or on reading it ( unless you get a parity error ) but on using what you read from it.
This is a normal pattern on programming bugs too. You have an error in some code and store something in a random ( or not so random ) ram location . That code seems to work ok. But then an unrelated piece of code reads the corrupted data and crashes ( it is one of the way the buffer overflows work, the guilty code overflows a buffer, but works, and another chunk of code gets its data overwritten and crashes ).
Is there a way to write a select that is able to crash the service?
With a good database, on good hardware, with adequate ( inifinite, as you can crash any service by just joining enough copies of a table to exhaust avalible ) memory and disk there shouldn't be, but if you read corrupted data or get hit by a bit flips in the middle of processing, it may Are you able to do a full database dump ( pg_dump, not base backup ) of your database? If you are then you are able to read all the tables, and I would suggest trying to reindex every table if you have quiescent periods ( pg_dump does not touch indexes, so if you have good data bad corrupted indexes that should fix it )
I will let you know the results of the hardware check after the planned restart.
I do not know ( or remember ) what your DB sizes and uptime requirements are. But I've had that kind of problems caused by corrupted disk structures, and have being able to recover them rewritting the database, that means dump, drop, restore, but this depends on the system, I cannot recommend doing it, but as I said before, if I had the same aplication in 4 machines crashing randomly in only one of them I would try to triple test the machine and dump / restore it.
Best ergards.
Francisco Olarte.
Roctest Barracuda scanned !
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2015-02-27 14:24:28 | Re: BUG #12808: BDR lock adding Postgis extension in one node |
Previous Message | william.welter | 2015-02-27 02:00:29 | Re: BUG #12799: libpq - SSL pqsecure_read() doesn't clean openssl error queue before reading |