Re: Segmentation Fault in Postgres server when using psqlODBC

From: Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>
To: Joshua Berry <yoberi(at)gmail(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org, pgsql-odbc(at)postgresql(dot)org
Subject: Re: Segmentation Fault in Postgres server when using psqlODBC
Date: 2013-05-24 21:25:07
Message-ID: 519FDAB3.3010009@tpf.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-odbc

Hi,

Psqlodbc drivers send Execite requests for cursors instead of
issueing FETCH commands.

regards,
Hiroshi Inoue

(2013/05/25 1:55), Joshua Berry wrote:
> Hi Groups,
>
> I'm dealing with periodic backend process segmentation faults. I'm
> posting to both the bugs and odbc lists as it seems that my
> application's use of pgsqlODBC triggers a bug in the postgres backend.
>
> The environment is:
> Clients: win32 clients using various version of the psqlODBC driver
> (mostly using 8.04.0200). The connection string contains
> "UseDeclareFetch=1" which causes long idle transactions, heavy cursor
> and savepoint use.
> Server: Dell dual Xeon x86 48GB RAM (12GB PG shared mem) RHEL system
> with Postgresql 9.2.4. Note that these same issues occurred on PG9.1 and
> PG8.4.
>
> I've experienced these issues for over a year now. However, during that
> time several things have changed which may or may not be related:
> * The schema has been modified with heavier use of triggers to update
> manually created materialized views (ie not using the 9.3 CREATE
> MATERIALIZED VIEW).
> * The number of concurrent users has increased (from perhaps 15
> concurrent users two years ago, to 30 concurrent users now).
> * The PG version used has changed from 8.4, to 9.1, and finally to 9.2
> * I've done recent tuning to the planner cost constants in order to
> favor index scans over table scans in more situations.
>
> I've looked through past error logs and I found that I had a segfault in
> the server process while using PG 8.4 about a dozen times over a 12
> month period. Each time one client's postgres process crashes, the
> backend forcefully closes all active connections due to possiby
> corrupted memory and then restarts. This leaves all active clients
> stranded as the connection is closed, and all cursors and savepoint info
> is lost. My app doesn't recover gracefully, and users are forced to
> click through some cryptic error messages from the application framework
> used for the app (Clarion) and then restart the app.
>
> A few months ago, I upgraded to another server with PG 9.1, assuming
> that the issue with the previous server with 8.4 was due to bad RAM, as
> I did observe a high ECC error count on the previous system as logged by
> the IPMI controller. However, I very quickly had another segfault on the
> new system with PG9.1. The default settings of the OS and PG init script
> disabled core dumps, so I only started collecting core files for the
> past few months.
>
> I posted details in pgsql-general April 10, 2013; here is a link to the
> thread:
> http://www.postgresql.org/message-id/CAPmZXM03MEDEn6nqqf_Phs3M1DK-EaXP5_K-LmirneOJMAQ-Hg@mail.gmail.com
>
> The crash always is some variation of the following stack, as observed
> in both PG91 and PG92 crashes:
> (gdb) bt
> #0 ResourceOwnerEnlargeCatCacheRefs (owner=0x0) at resowner.c:603
> #1 0x000000000070f372 in SearchCatCache (cache=0x27fab90, v1=<value
> optimized out>, v2=<value optimized out>, v3=<value optimized out>,
> v4=<value optimized out>) at catcache.c:1136
> #2 0x000000000071b1ae in getTypeOutputInfo (type=20,
> typOutput=0x2b3db80, typIsVarlena=0x2b3db88 "") at lsyscache.c:2482
> #3 0x000000000045d127 in printtup_prepare_info (myState=0x2810290,
> typeinfo=0x29ad7b0, numAttrs=42) at printtup.c:263
> #4 0x000000000045d4c4 in printtup (slot=0x3469650, self=0x2810290) at
> printtup.c:297
> #5 0x000000000065a76a in RunFromStore (portal=0x285fed0,
> direction=<value optimized out>, count=10, dest=0x2810290) at pquery.c:1122
> #6 0x000000000065a852 in PortalRunSelect (portal=0x285fed0,
> forward=<value optimized out>, count=10, dest=0x2810290) at pquery.c:940
> #7 0x000000000065bcf8 in PortalRun (portal=0x285fed0, count=10,
> isTopLevel=1 '\001', dest=0x2810290, altdest=0x2810290,
> completionTag=0x7fffee67f7c0 "") at pquery.c:788
> #8 0x0000000000659552 in exec_execute_message (argc=<value optimized
> out>, argv=<value optimized out>, dbname=0x2768370 "DBNAME",
> username=<value optimized out>) at postgres.c:1929
> #9 PostgresMain (argc=<value optimized out>, argv=<value optimized
> out>, dbname=0x2768370 "DBNAME", username=<value optimized out>) at
> postgres.c:4016
> #10 0x0000000000615161 in BackendRun () at postmaster.c:3614
> #11 BackendStartup () at postmaster.c:3304
> #12 ServerLoop () at postmaster.c:1367
> #13 0x0000000000617dcc in PostmasterMain (argc=<value optimized out>,
> argv=<value optimized out>) at postmaster.c:1127
> #14 0x00000000005b6830 in main (argc=5, argv=0x2766480) at main.c:199
> (gdb)
>
> Andres Freund and Tom Lane spent some time tracking down possible root
> causes with some of the following summaries:
>
> Tom Lane writes:
> |Andres Freund <andres(at)2ndquadrant(dot)com <mailto:andres(at)2ndquadrant(dot)com>>
> writes:
> |> Tom: It looks to me like printtup_prepare_info won't normally be called
> |> in an held cursor. But if some concurrent DDL changed the number of
> |> columns in typeinfo vs thaose in the the receiver that could explain the
> |> issue and why its not seen all the time, right?
> |
> |It looks to me like there are probably two triggering conditions:
> |
> |1. Client is doing a direct protocol Execute on a held-cursor portal.
> |
> |2. Cache flush occurs to drop the syscache entries needed by
> |getTypeOutputInfo. (Otherwise, they'd still be around from when the
> |held cursor was created.)
> |
> |The first of these explains why we don't see it in the
> |CLOBBER_CACHE_ALWAYS buildfarm runs, and the second one explains why
> |Joshua is only seeing it intermittently and not every darn time his
> |application does that.
>
> Andres Freund writes:
> |On 2013-04-10 19:06:12 -0400, Tom Lane wrote:
> |> I wrote:
> |> > (Wanders away wondering just how much the regression tests exercise
> |> > holdable cursors.)
> |>
> |> And the answer is they're not testing this code path at all, because if
> |> you do
> |> DECLARE c CURSOR WITH HOLD FOR ...
> |> FETCH ALL FROM c;
> |> then the second query executes with a portal (and resource owner)
> |> created to execute the FETCH command, not directly on the held portal.
> |>
> |> After a little bit of thought I'm not sure it's even possible to
> |> reproduce this problem with libpq, because it doesn't expose any way to
> |> issue a bare protocol Execute command against a pre-existing portal.
> |> (I had thought psqlOBC went through libpq, but maybe it's playing some
> |> games here.)
> |
> |Hm. PQexecPrepared() looks like it can do that.
>
> I didn't find that any bugs were officially reported/documented, so I
> wanted to followup with more detail and follow the guidelines here:
> http://www.postgresql.org/docs/current/static/bug-reporting.html
> Specifically, I decided to update to the latest stable release,
> currently PG9.2.4.
>
> After updating to PG92 three days ago, I've had two of the same
> segfaults. Either I'm very (un)lucky or the intermittant issue is
> occurring more frequently with PG92.
>
> On a test machine running a snapshort of the production database, and
> with the backend configured to log all queries, I have an example of the
> type of traffic that client apps will perform:
> https://dl.dropboxusercontent.com/u/4715250/pg92_psqlODBC_08.04.0200.txt
> In the above logfile, the logfile was cleared, the client app was
> started, A window was opened, it was closed, and the app was closed. I
> installed pgODBC version 09.01.0200, and aside from cursor names and
> timestamps differing, all other lines were identical.
>
> I do not currently have a testing environment which can emulate the
> production environment enough to trigger this bug. I've never had this
> crash outside of the production environment.
>
> Any thoughts/suggestions?
>
> Regards,
> -Joshua

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Simon Riggs 2013-05-26 08:31:56 COPY .... (FORMAT binary) syntax doesn't work
Previous Message Joshua Berry 2013-05-24 16:55:19 Segmentation Fault in Postgres server when using psqlODBC

Browse pgsql-odbc by date

  From Date Subject
Next Message ter062424 2013-05-24 21:42:54 Re: ODBC SELECT Timeout
Previous Message Hiroshi Inoue 2013-05-24 21:13:23 Re: ODBC SELECT Timeout