Hello,
We are encountering “Exit_status = -2” on a particular node and we are not sure why…
There are no mom_logs and server_logs don’t help either.
We did notice that submitting a job after a node reboot results in a hanging experience for the user:
$ qsub -I -l walltime=08:00:00 -l host=node0047 -q def-devel
qsub: waiting for job 38178.bright01-thx to start
<…it just HANGS…so I quit…>
^CDo you wish to terminate the job and exit (y|[n])? y
Job 38178.bright01-thx is being deleted
Even though qstat -f 38178 says:
comment = Job run at Tue Aug 14 at 11:01 on (node0047:ncpus=1:mem=1048576kb
:mic_cores=0:ngpus=0)
And then the next job submission attempt just quits immediately with the ‘Exit_status = -2’ error:
$ qsub -I -l walltime=08:00:00 -l host=node0047 -q def-devel
qsub: waiting for job 38179.bright01-thx to start
qsub: job 38179.bright01-thx ready
qsub: job 38179.bright01-thx completed
Any ideas how we can uncover the root cause of this issue?
Thanks,
Siji
adarsh
August 15, 2018, 7:48am
2
Please increase the mom logging by adding the below line to $PBS_HOME/mom_priv/config and restart the mom service on node0047 and submit a new job
$logevent 0xfffffff
Exit_Code -2 means - Job execution failed, after files, no retry
Probably tracejob and mom logs would help.
Might be related to the ports / firewall.
Thank you
@adarsh
After increasing logging, there still isn’t any mom_log to review. In fact, the $PBS_MOM_HOME/mom_logs is empty.
tracejob 38822
Job: 38822.bright01-thx
08/16/2018 09:37:40 L Considering job to run
08/16/2018 09:37:40 S Job Queued at request of saula@login-0002, owner = saula@login-0002, job name = STDIN, queue = def-devel
08/16/2018 09:37:40 S Job Run at request of Scheduler@bright01-thx.thunder.ccast on exec_vnode (node0047:ncpus=1:mem=1048576kb:mic_cores=0:ngpus=0)
08/16/2018 09:37:40 S Job Modified at request of Scheduler@bright01-thx.thunder.ccast
08/16/2018 09:37:40 L Job run
08/16/2018 09:37:40 S enqueuing into def-devel, state 1 hop 1
08/16/2018 09:37:40 A queue=def-devel
08/16/2018 09:37:40 A user=saula group=saula_g project=_pbs_project_default jobname=STDIN queue=def-devel ctime=1534430260 qtime=1534430260 etime=1534430260
start=1534430260 exec_host=node0047/0 exec_vnode=(node0047:ncpus=1:mem=1048576kb:mic_cores=0:ngpus=0) Resource_List.host=node0047
Resource_List.mem=1gb Resource_List.mic_cores=0 Resource_List.ncpus=1 Resource_List.ngpus=0 Resource_List.nodect=1 Resource_List.place=pack
Resource_List.select=1:host=node0047:ncpus=1 Resource_List.walltime=08:00:00 resource_assigned.mem=1048576kb resource_assigned.ncpus=1
resource_assigned.ngpus=0 resource_assigned.mic_cores=0
08/16/2018 09:39:02 S Obit received momhop:1 serverhop:1 state:4 substate:41
08/16/2018 09:39:06 S Exit_status=-2 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.ncpus=1 resources_used.vmem=0kb
resources_used.walltime=00:00:00
08/16/2018 09:39:06 A user=saula group=saula_g project=_pbs_project_default jobname=STDIN queue=def-devel ctime=1534430260 qtime=1534430260 etime=1534430260
start=1534430260 exec_host=node0047/0 exec_vnode=(node0047:ncpus=1:mem=1048576kb:mic_cores=0:ngpus=0) Resource_List.host=node0047
Resource_List.mem=1gb Resource_List.mic_cores=0 Resource_List.ncpus=1 Resource_List.ngpus=0 Resource_List.nodect=1 Resource_List.place=pack
Resource_List.select=1:host=node0047:ncpus=1 Resource_List.walltime=08:00:00 session=0 end=1534430346 Exit_status=-2 resources_used.cpupercent=0
resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.ncpus=1 resources_used.vmem=0kb resources_used.walltime=00:00:00 run_count=1
Also, what does the “…,after files,…” mean in your interpretation of Exit_Code -2?
adarsh
August 16, 2018, 3:21pm
4
sijisaula:
after files
means after staging of the data
adarsh
August 16, 2018, 3:23pm
5
Please check pbs_mom service is up and running, please share your /etc/pbs.conf contents.
If mom services are up and running then there should be $PBS_HOME/mom_logs/YYYYMMDD or $PBS_MOM_HOME/mom_logs/YYYYMMDD
Mom is up per systemctl and our pbs.conf is shown beneath:
[node0047 ~]# systemctl status pbs
● pbs.service - LSB: The Portable Batch System (PBS) is a flexible workload
Loaded: loaded (/etc/rc.d/init.d/pbs; bad; vendor preset: disabled)
Active: active (running) since Wed 2018-08-15 09:33:44 CDT; 1 day 1h ago
Docs: man:systemd-sysv-generator(8)
Process: 72288 ExecStop=/etc/rc.d/init.d/pbs stop (code=exited, status=0/SUCCESS)
Process: 72335 ExecStart=/etc/rc.d/init.d/pbs start (code=exited, status=0/SUCCESS)
CGroup: /system.slice/pbs.service
└─72405 /cm/shared/apps/pbspro-ce/current/sbin/pbs_mom
Aug 15 09:33:44 node0047 systemd[1]: Starting LSB: The Portable Batch System (PBS) is a flexible workload…
Aug 15 09:33:44 node0047 pbs[72335]: Starting PBS
Aug 15 09:33:44 node0047 pbs[72335]: PBS mom
Aug 15 09:33:44 node0047 systemd[1]: Started LSB: The Portable Batch System (PBS) is a flexible workload.
[node0047 ~]# cat /etc/pbs.conf
PBS_EXEC=/cm/shared/apps/pbspro-ce/current
PBS_HOME=/cm/local/apps/pbspro-ce/var/spool
PBS_START_SERVER=0
PBS_START_SCHED=0
PBS_MOM_HOME=/cm/local/apps/pbspro-ce/var/spool
PBS_START_MOM=1
PBS_START_COMM=0
PBS_SERVER=bright01-thx
PBS_SCP=/usr/bin/scp
PBS_RSHCOMMAND=/usr/bin/ssh
PBS_CORE_LIMIT=unlimited
adarsh
August 16, 2018, 4:43pm
7
Thank you , Could you please search 20180816 file on the mom node ?
find / -name “20180816” -print
Still nothing…
[node0047 ~]# find / -name “20180816” -print
[node0047 ~]#
I’m doing some other digging as well, but let me know if you have other thoughts.