Re: signal 11 on AIX: 7.4.2

From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Sullivan <ajs(at)crankycanuck(dot)ca>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: signal 11 on AIX: 7.4.2
Date: 2004-09-17 23:26:55
Message-ID: 414B72BF.1000402@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4/19/2004 1:18 PM, Jan Wieck wrote:

> Tom Lane wrote:
>
>> Andrew Sullivan <ajs(at)crankycanuck(dot)ca> writes:
>>> On Thu, Apr 15, 2004 at 07:52:59PM -0400, Tom Lane wrote:
>>>> I can see from your trace that you are using the getaddrinfo code from
>>>> libc, but where is configure finding a header that declares struct
>>>> addrinfo?
>>
>>> Hrm, I can't seem to tell. I see this in config.log, but it isn't
>>> telling me where it found it. Am I looking in the wrong place?
>>
>> What you'd need to do is determine which system headers are being
>> #include'd by that config test, and then look through them to find
>> struct addrinfo.
>
> judging by gdb's structure printing, the crashed postgres instance used
> the non-43 compatible 64-bit version of the strucure. What I don't
> really get is that the whole excercise seems to have scribbled over the
> stack. The hints pointer originating from the on-stack structure in
> parse_hba is somehow pointing into the blue.

This issue is still not closed and it is hitting us more and more. So I
would like to add some more of what we have done in the hope to get some
more ideas.

The "scribbled over the stack" part turned out to be not true. The stack
dump is fine if compiled with -O0. The problem persists in 7.4.5.

I have tried to isolate the getaddrinfo() calls by writing a program
that does the getaddrinfo() calls done during PM startup, then keeps
100-200 child processes in a fork()/wait() loop and every child process
does the same getaddrinfo() calls a starting backend would perform
during the pg_hba parsing. This program does not crash.

So far we did not get a libc from IBM that has debug symbols. So I only
know that getaddrinfo() calls getaddrinfo2(), which calls memmove() and
that one crashes with a SIGSEGV. All the call arguments to getaddrinfo()
look absolutely fine. I hope to get that libc any time soon to see what
exactly that memmove tries to access.

The problem comes and goes. So either I can cause a coredump just on the
snap by running a shellscript that does 100 psql -c "select version()"
calls, or it is next to impossible to crash it at all.

There are numerous reports on the net about getaddrinfo() causing grief
on AIX and it seems to be IPV6 related. For the moment we intend to
replace the call with a slightly limited implementation using
inet_aton() in getaddrinfo_all() whenever AI_NUMERICHOST is set. This
will lose us the IPV6 support as hba.c can't parse those pg_hba.conf
lines any more. So it is not a satisfactory workaround for PostgreSQL.
But I will make that patch available tomorrow night in the event someone
else finds it usefull.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2004-09-17 23:32:30 Re: signal 11 on AIX: 7.4.2
Previous Message Tom Lane 2004-09-17 22:46:57 Re: Default value for stats_command_string (GUC)