[Fwd: [HACKERS] PostgreSQL ES3.0 problems?]

From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: [Fwd: [HACKERS] PostgreSQL ES3.0 problems?]
Date: 2004-03-26 00:31:54
Message-ID: 406379FA.7040503@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

-------- Original Message --------
Subject: [HACKERS] PostgreSQL ES3.0 problems?
Date: Wed, 24 Mar 2004 17:06:23 -0800
From: Joshua D. Drake <jd(at)commandprompt(dot)com>
Organization: Command Prompt, Inc.
To: pgsql-hackers(at)postgresql(dot)org

Hello,

About a week ago we had a report of replicator not working correctly on
ES 3.0 when running with more than one CPU. Replicator worked perfectly
when utilizing only one CPU.

The machine was a Quad Xeon with HyperThreading, SCSI RAID array and 4
Gig of RAM.

During our testing we found issues with - nocachegetattr.

Our tests showed the following output using gdb:

================== GDB LOG ======================
[postgres(at)es30 postgresql-7.3.6]$ gdb /usr/local/pgsql/bin/postmaster
/home/postgres/data_master/base/16976/core.858
GNU gdb Red Hat Linux (5.3.90-0.20030710.40rh)
Copyright 2003 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you
are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for
details.
This GDB was configured as "i386-redhat-linux-gnu"...Using host
libthread_db library "/lib/tls/libthread_db.so.1".

Core was generated by `postgres: postgres test [local]
SELECT '.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /usr/lib/libz.so.1...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /usr/lib/libreadline.so.4...done.
Loaded symbols for /usr/lib/libreadline.so.4
Reading symbols from /lib/libtermcap.so.2...done.
Loaded symbols for /lib/libtermcap.so.2
Reading symbols from /lib/libcrypt.so.1...done.
Loaded symbols for /lib/libcrypt.so.1
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/tls/libm.so.6...done.
Loaded symbols for /lib/tls/libm.so.6
Reading symbols from /lib/tls/libc.so.6...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
#0 0x0806b729 in nocachegetattr (tuple=0x82992d8, attnum=17,
tupleDesc=0x0, isnull=0xbfffae2f "") at heaptuple.c:406
406 usecache = false;
(gdb) l
401 }
402
403 off = att_addlength(off, att[i]->attlen, tp + off);
404
405 if (usecache && att[i]->attlen <= 0)
406 usecache = false;
407 }
408
409 off = att_align(off, att[attnum]->attalign);
410
(gdb)

Our testing was performed on:

Dual Athlon MP 2800 (Barton)
2 Gig of Registered ECC ram
3Ware 0+1 IDE Array
LVM

The log output was thus:

[postgres(at)es30 postgresql-7.3.6]$ /usr/local/pgsql/bin/postmaster -D
~/data_master
DEBUG: /usr/local/pgsql/bin/postmaster: PostmasterMain: initial environ
dump:
DEBUG: -----------------------------------------
DEBUG: HOSTNAME=es30
DEBUG: TERM=xterm
DEBUG: SHELL=/bin/bash
DEBUG: HISTSIZE=1000
DEBUG: SSH_CLIENT=192.168.1.65 1745 22
DEBUG: SSH_TTY=/dev/pts/2
DEBUG: USER=postgres
DEBUG:
LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:
DEBUG: MAIL=/var/spool/mail/postgres
DEBUG:
PATH=/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/home/postgres/bin
DEBUG: INPUTRC=/etc/inputrc
DEBUG: PWD=/home/postgres/postgresql-7.3.6
DEBUG: LANG=en_US.UTF-8
DEBUG: SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
DEBUG: SHLVL=1
DEBUG: HOME=/home/postgres
DEBUG: LOGNAME=postgres
DEBUG: SSH_CONNECTION=192.168.1.65 1745 192.168.1.199 22
DEBUG: LESSOPEN=|/usr/bin/lesspipe.sh %s
DEBUG: G_BROKEN_FILENAMES=1
DEBUG: _=/usr/local/pgsql/bin/postmaster
DEBUG: OLDPWD=/home/postgres/src/SOURCES/postgresql-7.3.6
DEBUG: -----------------------------------------
DEBUG: FindExec: found "/usr/local/pgsql/bin/postgres" using argv[0]
DEBUG: invoking IpcMemoryCreate(size=36421632)
DEBUG: FindExec: found "/usr/local/pgsql/bin/postmaster" using argv[0]
LOG: database system was shut down at 2004-03-19 06:48:17 PST
LOG: checkpoint record is at 0/58E2B4F8
LOG: redo record is at 0/58E2B4F8; undo record is at 0/0; shutdown TRUE
LOG: next transaction id: 630; next oid: 217646
LOG: database system is ready
DEBUG: proc_exit(0)
DEBUG: shmem_exit(0)
DEBUG: exit(0)
DEBUG: reaping dead processes
DEBUG: /usr/local/pgsql/bin/postmaster child[858]: starting with (
DEBUG: postgres
DEBUG: -v131072
DEBUG: -p
DEBUG: test
DEBUG: )
DEBUG: InitPostgres
DEBUG: BackendStartup: forked pid=858 socket=8
DEBUG: StartTransactionCommand
DEBUG: ProcessUtility
DEBUG: CommitTransactionCommand
DEBUG: StartTransactionCommand
DEBUG: ProcessQuery
DEBUG: ProcessUtility
DEBUG: CommitTransactionCommand
DEBUG: StartTransactionCommand
DEBUG: ProcessUtility
DEBUG: CommitTransactionCommand
DEBUG: StartTransactionCommand
DEBUG: ProcessQuery
DEBUG: ProcessUtility
DEBUG: CommitTransactionCommand
DEBUG: StartTransactionCommand
DEBUG: ProcessQuery
DEBUG: reaping dead processes
DEBUG: child process (pid 858) was terminated by signal 11
LOG: server process (pid 858) was terminated by signal 11
LOG: terminating any other active server processes
LOG: all server processes terminated; reinitializing shared memory and
semaphores
DEBUG: shmem_exit(0)
DEBUG: invoking IpcMemoryCreate(size=36421632)
LOG: database system was interrupted at 2004-03-19 06:52:03 PST
LOG: checkpoint record is at 0/58E2B4F8
LOG: redo record is at 0/58E2B4F8; undo record is at 0/0; shutdown TRUE
LOG: next transaction id: 630; next oid: 217646
LOG: database system was not properly shut down; automatic recovery in
progress
FATAL: The database system is starting up
DEBUG: proc_exit(0)
DEBUG: shmem_exit(0)
DEBUG: exit(0)
DEBUG: BackendStartup: forked pid=860 socket=8
DEBUG: reaping dead processes
DEBUG: child process (pid 860) exited with exit code 0
LOG: ReadRecord: record with zero length at 0/58E2B538
LOG: redo is not required
LOG: database system is ready
DEBUG: proc_exit(0)
DEBUG: shmem_exit(0)
DEBUG: exit(0)
DEBUG: reaping dead processes

After chasing our tails a little bit we decided to test the community
PostgreSQL 7.3.6. We were able to produce similar results with
PostgreSQL 7.3.6. Again only in SMP mode, only on ES 3.0.

Similar problems occured when trying to do a select * from a large
table (about 2Gig in size with about 1.5 million rows). Here is a
strings from a core dump that is associated:

CORE
CORE
postmaster
postgres: postgres test [local] SELECT
CORE
postmaster
CORE
FLINUX
hat exists.
out of memory in PortalHashTable
trying to delete portal name that does not exist.
ltsWriteBlock: failed to write block %ld of temporary file
Perhaps out of disk space?
ltsReadBlock: failed to read block %ld of temporary file
LogicalTapeWrite: impossible state
LogicalTapeBackspace: unexpected end of tape
LogicalTapeSeek: unexpected end of tape
tuplesort_puttuple: invalid state
tuplesort_performsort: invalid state
tuplesort_gettuple: invalid state
tuplesort_gettuple: bogus tuple len in backward scan
tuplesort_rescan: invalid state
tuplesort_markpos: invalid state
tuplesort_restorepos: invalid state
tuplesort: unexpected end of tape
tuplesort: unexpected end of data
SelectSortFunction: cache lookup failed for operator %u
Cannot create unique index. Table contains non-unique values
copytup_datum() should not be called
tuplestore_puttuple: invalid state
tuplestore_donestoring: invalid state
tuplestore_donestoring: seek(0) failed
tuplestore_gettuple: invalid state
tuplestore_gettuple: bogus tuple len in backward scan
tuplestore_rescan: invalid state
tuplestore_rescan: seek(0) failed
tuplestore_markpos: invalid state
tuplestore_restorepos: invalid state
tuplestore: unexpected end of tape
tuplestore: unexpected end of data
CopyQuerySnapshot: no snapshot has been set
pg_char_to_encname_struct(): encoding name too long
UtfToLocal: could not convert UTF-8 (0x%04x). Ignored
LocalToUtf: could not convert (0x%04x) %s to UTF-8. Ignored
default conversion proc for %s to %s not found
default conversion proc %u for %s to %s not found in pg_proc
Invalid destination encoding name %s
Invalid source encoding name %s
SetDatabaseEncoding(): invalid database encoding
Invalid %s character sequence found (0x
Unicode >= 0x10000 is not supported
MbP?
@UUUUUU
A. K.
bool
bytea
char
name
int2
int2vector
int4
regproc
regclass
regtype
text
oidvector
smgr
_int4
_aclitem
ctid
xmin
cmin
xmax
cmax
tableoid

MbP?{
-infinity
abstime
acsst
*acst
aesst

I tried running the pg_dump in single proc mode and the machine ran just
fine. Once we put it in SMP mode and ran the dump again it trashed the
machine we had to reboot. Once we rebooted we forced an FSCK and then we
kernel panicked. We now have fairly extensive FS corruption ;)

Any thoughts?

Sincerely,

Joshua D. Drake

--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd(at)commandprompt(dot)com - http://www.commandprompt.com
Mammoth PostgreSQL Replicator. Integrated Replication for PostgreSQL

--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd(at)commandprompt(dot)com - http://www.commandprompt.com
Mammoth PostgreSQL Replicator. Integrated Replication for PostgreSQL

Attachment Content-Type Size
jd.13211DEFANGED-vcf application/defanged 641 bytes
file:///tmp/nsmail-1.asc text/plain 196 bytes
jd.vcf text/x-vcard 640 bytes

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2004-03-26 00:36:33 Re: Problem with Serializable transactions
Previous Message Tom Lane 2004-03-25 22:32:59 Re: Fwd: Infinite recursion in function causes DoS