I had been running PBS Pro 17.1.0 on our CentOS 6.6 server for several weeks without problem.
Today I installed my own hook (unfortunately it included an error), installed it, and restarted pbs.
After that pbs_server no longer starts complaining:
Because pbs_server is not running, I was unable to uninstall my hook using qmgr.
Then I’ve tried:
stop pbs
remove /var/spool/pbs
download the newest source code form GitHub
compile and install it
configure it as before, without installing my hook
start pbs
But the situation did not change: pbs_server continues to abort by a segmentation fault.
My questions are threefold:
Q1. How can I uninstall a hook without qmgr?
Q2. Does the installation of a hook change something outside /var/spool/pbs?
Q3. How can I investigate the segmentation fault issue?
server_logs looks like this:
11/14/2017 21:13:19;0002;Server@host01;Svr;Log;Log opened
11/14/2017 21:13:19;0002;Server@host01;Svr;Server@host01;pbs_version=17.1.0
11/14/2017 21:13:19;0002;Server@host01;Svr;Server@host01;pbs_build=mach=N/A:security=N/A:configure_args=N/A
11/14/2017 21:13:19;0002;Server@host01;Svr;Server@host01;hostname=host01;pbs_leaf_name=N/A;pbs_mom_node_name=N/A
11/14/2017 21:13:19;0002;Server@host01;Svr;Server@host01;ipv4 interface lo: localhost localhost.localdomain localhost4 localhost4.localdomain4
11/14/2017 21:13:19;0002;Server@host01;Svr;Server@host01;ipv4 interface bond0: host01
11/14/2017 21:13:19;0002;Server@host01;Svr;Server@host01;ipv6 interface lo: localhost localhost.localdomain localhost6 localhost6.localdomain6
11/14/2017 21:13:19;0006;Server@host01;Fil;Server@host01;Version 17.1.0, started, initialization type = 1
11/14/2017 21:13:19;0002;Server@host01;Svr;Server@host01;pbs_status_db exit code 1
11/14/2017 21:13:19;0002;Server@host01;Svr;Server@host01;Starting PBS dataservice
11/14/2017 21:13:22;0002;Server@host01;Svr;Server@host01;connected to PBS dataservice@host01
11/14/2017 21:13:23;0086;Server@host01;Svr;pbs_python_ext_quick_start_interpreter;--> Python Interpreter quick started, compiled with version:'2.6.6' <--
11/14/2017 21:13:23;0086;Server@host01;Svr;pbs_python_ext_quick_start_interpreter;--> Inserted Altair PBS Python modules dir '/opt/pbs/lib/python/altair' <--
11/14/2017 21:13:23;0002;Server@host01;Fil;Server@host01;PBS Server hostname is host01, Server-id is 1
11/14/2017 21:13:23;0002;Server@host01;n/a;setup_env;read environment from /var/spool/pbs/pbs_environment
11/14/2017 21:13:23;0004;Server@host01;Svr;Server@host01;node_fail_requeue value changed to 310
I am ready to provide other logs etc. if needed.
I would greatly appreciate your help.
I would suggest you look for any file that starts with “core” under PBS_HOME (/var/spool/pbs). You may use the core file in conjunction with the binary you compiled and gdb to print a stack trace which should indicate the exact location where the failure occurred. Providing a stack trace would be very useful. If you built and installed RPMs, you may need to install the debuginfo RPM in order to print a stack trace with symbol names.
There is no “supported” method to uninstall a hook when the server is down. Deleting the appropriate files in /var/spool/pbs/server_priv/hooks and /var/spool/pbs/mom_priv/hooks should accomplish what you want, but you didn’t hear it from me.
The installation of a hook should not change anything outside of PBS_HOME.
$ gdb /opt/pbs/sbin/pbs_server.bin coredump
<snip>
Reading symbols from /opt/pbs/sbin/pbs_server.bin...done.
warning: core file may not match specified executable file.
[New Thread 20902]
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug/lib64/ld-2.12.so.debug...done.
done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Core was generated by `/usr/bin/postgres -D /var/spool/pbs/datastore -p 15007'.
Program terminated with signal 6, Aborted.
#0 0x000000360dc32625 in ?? ()
(gdb) bt full
#0 0x000000360dc32625 in ?? ()
No symbol table info available.
#1 0x000000360dc33e05 in ?? ()
No symbol table info available.
#2 0x00007fff51139290 in ?? ()
No symbol table info available.
#3 0x00000000025b6a30 in ?? ()
No symbol table info available.
#4 0x0000000000000000 in ?? ()
No symbol table info available.
I’m sorry but I’m not very familiar with gdb. Any suggestion would be appreciated.
Thank you,
It appears as though the core file was generated by Postgres and not by PBS Pro. The Postgres folks may find this very useful. Try specifying /usr/bin/postgres as your binary and printing another backtrace. You may need to install the postgres-debuginfo package in order to print the symbol names, which would be very useful to a developer.
Please do follow up on this forum, but it looks like the current problem is with Postgres.
Thank you for your kind suggestions.
I agree with that the problem is with PostgresSQL 8.4.20-8.
I’m afraid that our Postgres is a kind of outdated but I cannot immediately update it.
I’m not even sure whether the Postgres folks accept my bug report…
Meanwhile I’ll try to clean up (i.e. rebuild from scratch) our PBS database used by Postgres.
I’ll report the results later.
Below is the backtrace (just for reference):
$ gdb /usr/bin/postgres coredump
<snip>
Core was generated by `/usr/bin/postgres -D /var/spool/pbs/datastore -p 15007'.
Program terminated with signal 6, Aborted.
#0 0x000000360dc32625 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
Missing separate debuginfos, use: debuginfo-install audit-libs-2.3.7-5.el6.x86_64 cyrus-sasl-lib-2.1.23-15.el6_6.1.x86_64 keyutils-libs-1.4-5.el6.x86_64 libselinux-2.0.94-5.8.el6.x86_64 nspr-4.10.6-1.el6_5.x86_64 nss-3.16.2.3-3.el6_6.x86_64 nss-util-3.16.2.3-2.el6_6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) set logging on
Copying output to gdb.txt.
(gdb) bt full
#0 0x000000360dc32625 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
resultvar = 0
pid = <value optimized out>
selftid = 20902
#1 0x000000360dc33e05 in abort () at abort.c:92
save_stage = 2
act = {__sigaction_handler = {sa_handler = 0x7fff51139290, sa_sigaction = 0x7fff51139290},
sa_mask = {__val = {39545392, 0, 11296384, 140734553625472, 8257192, 39554368,
5406593250411399504, 140734553625584, 11296384, 140734553625584, 11296384,
140734553625584, 0, 39545120, 6930394, 139911567057000}}, sa_flags = 8128155,
sa_restorer = 0xac5e80}
sigs = {__val = {32, 0 <repeats 15 times>}}
#2 0x000000000069c3f0 in errfinish (dummy=<value optimized out>) at elog.c:526
edata = 0xac5e80
elevel = 22
oldcontext = 0x25b6a30
econtext = 0x0
__func__ = "errfinish"
#3 0x0000000000481a97 in ReadControlFile () at xlog.c:4338
crc = <value optimized out>
fd = -1
__func__ = "ReadControlFile"
#4 0x0000000000484917 in XLOGShmemInit () at xlog.c:4629
foundCFile = 0 '\000'
foundXLog = 0 '\000'
allocptr = <value optimized out>
#5 0x00000000005d9ad2 in CreateSharedMemoryAndSemaphores (makePrivate=0 '\000', port=15007)
at ipci.c:182
__func__ = "CreateSharedMemoryAndSemaphores"
#6 0x00000000005bf00b in reset_shared () at postmaster.c:2032
No locals.
#7 PostmasterStateMachine () at postmaster.c:2936
__func__ = "PostmasterStateMachine"
#8 0x00000000005bf3d5 in reaper (postgres_signal_arg=<value optimized out>) at postmaster.c:2478
save_errno = 4
pid = 0
exitstatus = <value optimized out>
status = 512
__func__ = "reaper"
#9 <signal handler called>
No symbol table info available.
#10 0x000000360dce1353 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:82
No locals.
#11 0x00000000005bdef2 in ServerLoop () at postmaster.c:1348
timeout = {tv_sec = 59, tv_usec = 996913}
rmask = {fds_bits = {224, 0 <repeats 15 times>}}
selres = <value optimized out>
readmask = {fds_bits = {224, 0 <repeats 15 times>}}
nSockets = 8
now = <value optimized out>
last_touch_time = 1510660478
__func__ = "ServerLoop"
#12 0x00000000005c0ac7 in PostmasterMain (argc=<value optimized out>, argv=<value optimized out>)
at postmaster.c:1041
opt = <value optimized out>
status = <value optimized out>
userDoption = <value optimized out>
i = <value optimized out>
__func__ = "PostmasterMain"
#13 0x000000000056b6a0 in main (argc=5, argv=0x25b6720) at main.c:198
No locals.
The real criminal is not my hook but default_qsub_arguments!
Just doing “set server default_qsub_arguments = ‘-Wsandbox=PRIVATE’” causes pbs_server a segmentation fault.
This is 100% reproducible in my environment.
Once pbs_server starts crashing, there is no other way than “rm -rf $PBS_HOME” to unset default_qsub_arguments… that’s a big problem
The complete procedure for reproducing the problem is as follows:
$ cat /etc/redhat-release
CentOS release 6.6 (Final)
$ uname -a
Linux host01 3.14.26-1.el6.x86_64 #1 SMP Tue Dec 16 22:52:26 JST 2014 x86_64 x86_64 x86_64 GNU/Linux
$ sudo service pbs stop
$ sudo rm -rf /opt/pbs/ /var/spool/pbs/
$ git clone https://github.com/PBSPro/pbspro.git
$ cd pbspro
$ git show
commit b45690493a14402b6b7e10be89c5df093f1256d7
$ ./autogen.sh && ./configure --prefix=/opt/pbs
$ make && sudo make install
$ sudo service pbs start
Starting PBS
PBS Home directory /var/spool/pbs does not exist.
Running /opt/pbs/libexec/pbs_habitat to create it.
***
*** WARNING: PBS_HOME not found in /var/spool/pbs
*** PBS Installation Summary
***
*** Postinstall script called as follows:
*** /opt/pbs/libexec/pbs_postinstall server 17.1.0 /opt/pbs /var/spool/pbs '' sameconf
***
*** PBS_HOME is /var/spool/pbs
*** Setting TZ from /etc/sysconfig/clock
*** Creating new file /var/spool/pbs/pbs_environment
***
*** The PBS Pro server has been installed in /opt/pbs/sbin.
*** The PBS Pro scheduler has been installed in /opt/pbs/sbin.
***
*** The PBS Pro communication agent has been installed in /opt/pbs/sbin.
***
*** The PBS Pro MOM has been installed in /opt/pbs/sbin.
***
*** The PBS commands have been installed in /opt/pbs/bin.
***
*** End of /opt/pbs/libexec/pbs_postinstall
*** Setting default queue and resource limits.
***
Connecting to PBS dataservice....connected to PBS dataservice@nlp-iax-00
*** End of /opt/pbs/libexec/pbs_habitat
Home directory /var/spool/pbs created.
PBS comm
/opt/pbs/sbin/pbs_comm ready (pid=11083), Proxy Name:nlp-iax-00:17001, Threads:4
PBS mom
Creating usage database for fairshare.
PBS sched
Connecting to PBS dataservice....connected to PBS dataservice@nlp-iax-00
Licenses valid for 10000000 Floating hosts
PBS server
$ sudo /opt/pbs/bin/qmgr -c "set server default_qsub_arguments = '-Wsandbox=PRIVATE'"
$ sudo service pbs restart
Restarting PBS
Stopping PBS
Shutting server down with qterm.
PBS server - was pid: 11412
PBS mom - was pid: 11106
PBS sched - was pid: 11109
PBS comm - was pid: 11083
Waiting for shutdown to complete
Starting PBS
PBS comm
/opt/pbs/sbin/pbs_comm ready (pid=12143), Proxy Name:nlp-iax-00:17001, Threads:4
PBS mom
Creating usage database for fairshare.
PBS sched
Connecting to PBS dataservice....connected to PBS dataservice@nlp-iax-00
/etc/init.d/pbs: line 282: 12190 Segmentation fault (core dumped) ${PBS_EXEC}/sbin/pbs_server
pbs_server startup failed, exit 139 aborting.
I’m happy to provide any logs etc. that helps debugging.
Thank you,