As usual, I can run a program(use hostname for example) in two hosts with command like this:
mpirun -N 1 -machinefile ./nodes2 hostname
I can get the 2 hostname without PBS.
Recently, I try to use PBS Pro which I installed through yum. I write a script(script.sh) as follow:
I run command “qsub script.sh” but it didn’t work. I see “Job cannot be executed” from mail.
Then I modify “place=scatter” to “place=free” and it works, but the 2 procs is on one host.
I have check with “pbsnodes -a” and sure 7 salve nodes exist.
How to solve this problem?
Enable the job history to see the job history by running (history would be counted from now on )
qmgr -c “set server job_history_enable=true” # keep the job history for 14 days
qmgr -c “set server job_history_duration=24:00:00” # if you want to set it 24 hours after enabling job history
Job has run on ohpc3-cn1 and ohpc3-cn2
You have some hostname resolution issues
Please make sure hostnames are resolvable (+reverse resolvable) for their static IP addresses.
( if you are using dynamic IP addresses, then it is not recommended )
and I try to run again, I got some different print in tracejob. But I think it still one problem, cause this print “Discard running job, A sister Mom failed to delete job” didn’t change.
09/26/2018 18:49:40 S enqueuing into workq, state 1 hop 1
09/26/2018 18:49:40 S Job Queued at request of hpc@sms3, owner = hpc@sms3, job name = JOB1, queue = workq
09/26/2018 18:49:40 S Job Modified at request of Scheduler@sms3
09/26/2018 18:49:40 S Obit received momhop:1 serverhop:1 state:4 substate:41
09/26/2018 18:49:41 S Obit received momhop:2 serverhop:2 state:4 substate:41
09/26/2018 18:49:41 S Obit received momhop:3 serverhop:3 state:4 substate:41
09/26/2018 18:49:41 S Obit received momhop:4 serverhop:4 state:4 substate:41
09/26/2018 18:49:42 S Obit received momhop:5 serverhop:5 state:4 substate:41
09/26/2018 18:49:42 S Obit received momhop:6 serverhop:6 state:4 substate:41
09/26/2018 18:49:42 S Obit received momhop:7 serverhop:7 state:4 substate:41
09/26/2018 18:49:42 S Obit received momhop:8 serverhop:8 state:4 substate:41
09/26/2018 18:49:43 S Obit received momhop:9 serverhop:9 state:4 substate:41
09/26/2018 18:49:43 S Obit received momhop:10 serverhop:10 state:4 substate:41
09/26/2018 18:49:43 S Obit received momhop:11 serverhop:11 state:4 substate:41
09/26/2018 18:49:44 S Obit received momhop:12 serverhop:12 state:4 substate:41
09/26/2018 18:49:44 S Obit received momhop:13 serverhop:13 state:4 substate:41
09/26/2018 18:49:44 S Obit received momhop:14 serverhop:14 state:4 substate:41
09/26/2018 18:49:45 S Obit received momhop:15 serverhop:15 state:4 substate:41
09/26/2018 18:49:45 S Obit received momhop:16 serverhop:16 state:4 substate:41
09/26/2018 18:49:45 S Obit received momhop:17 serverhop:17 state:4 substate:41
09/26/2018 18:49:46 L Considering job to run
09/26/2018 18:49:46 S Job Run at request of Scheduler@sms3 on exec_vnode (ohpc3-cn1:ncpus=1)+(ohpc3-cn2:ncpus=1)
09/26/2018 18:49:46 S Discard running job, A sister Mom failed to delete job
09/26/2018 18:49:46 S Job requeued, execution node down
09/26/2018 18:49:46 L Job run
09/26/2018 18:49:46 S Obit received momhop:18 serverhop:18 state:4 substate:41
09/26/2018 18:49:46 S Obit received momhop:19 serverhop:19 state:4 substate:41
09/26/2018 18:49:46 S Obit received momhop:20 serverhop:20 state:4 substate:41
09/26/2018 18:49:46 S Obit received momhop:21 serverhop:21 state:4 substate:41
Do you have the same /etc/hosts file populated on all the compute nodes
Please note:
09/26/2018 16:24:38 S Job Run at request of Scheduler@sms3 on exec_vnode (ohpc3-cn1:ncpus=1)+(ohpc3-cn2:ncpus=1)
ohpc3-cn1 == mother superior node
ohpc3-cn2 == is the sister node
your qstat -fx 33 output
Please check the mom logs on both these nodes ohpc3-cn1 and ohpc3-cn2
as root user on these nodes:
source /etc/pbs.conf
cd $PBS_HOME/mom_logs
vi 20180926 # check the job id 33 and logs associated with it
This would clearly mention the issue (or you can share those two files )
I didn’t use the same /etc/hosts on all the compute nodes, I’ve fixed it and run the script again.
Failed again.
I take the mom_log/20180926 of id 35(the newest job run above, only one file cause I use OpenHPC and root fs are shared), I found that the log repeat following print:
09/26/2018 20:21:33;0100;pbs_mom;Req;;Type 1 request received from root@126.26.136.121:15001, sock=1
09/26/2018 20:21:33;0100;pbs_mom;Req;;Type 3 request received from root@126.26.136.121:15001, sock=1
09/26/2018 20:21:33;0100;pbs_mom;Req;;Type 5 request received from root@126.26.136.121:15001, sock=1
09/26/2018 20:21:33;0008;pbs_mom;Job;35.sms3;job_start_error 15010 from node 126.26.136.123:15003 could not JOIN_JOB successfully
09/26/2018 20:21:33;0008;pbs_mom;Job;35.sms3;kill_job
09/26/2018 20:21:33;0100;pbs_mom;Job;35.sms3;ohpc3-cn1 cput= 0:00:00 mem=0kb
09/26/2018 20:21:33;0100;pbs_mom;Job;35.sms3;ohpc3-cn2 cput= 0:00:00 mem=0kb
09/26/2018 20:21:33;0008;pbs_mom;Job;35.sms3;no active tasks
09/26/2018 20:21:33;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in job_save, error on open
09/26/2018 20:21:33;0100;pbs_mom;Job;35.sms3;Obit sent
09/26/2018 20:21:33;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in job_save, error on open
09/26/2018 20:21:33;0100;pbs_mom;Req;;Type 6 request received from root@126.26.136.121:15001, sock=1
09/26/2018 20:21:33;0080;pbs_mom;Job;35.sms3;delete job request received
09/26/2018 20:21:33;0001;pbs_mom;Job;35.sms3;Unable to send delete job request to one or more sisters
09/26/2018 20:21:33;0008;pbs_mom;Job;35.sms3;kill_job
09/26/2018 20:21:33;0080;pbs_mom;Req;req_reject;Reject reply code=15059, aux=0, type=6, from root@126.26.136.121:15001
09/26/2018 20:21:33;0100;pbs_mom;Req;;Type 1 request received from root@126.26.136.121:15001, sock=1
09/26/2018 20:21:33;0100;pbs_mom;Req;;Type 3 request received from root@126.26.136.121:15001, sock=1
09/26/2018 20:21:33;0100;pbs_mom;Req;;Type 5 request received from root@126.26.136.121:15001, sock=1
09/26/2018 20:21:33;0008;pbs_mom;Job;35.sms3;job_start_error 15010 from node 126.26.136.123:15003 could not JOIN_JOB successfully
09/26/2018 20:21:33;0008;pbs_mom;Job;35.sms3;kill_job
09/26/2018 20:21:33;0100;pbs_mom;Job;35.sms3;ohpc3-cn1 cput= 0:00:00 mem=0kb
09/26/2018 20:21:33;0100;pbs_mom;Job;35.sms3;ohpc3-cn2 cput= 0:00:00 mem=0kb
09/26/2018 20:21:33;0008;pbs_mom;Job;35.sms3;no active tasks
09/26/2018 20:21:33;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in job_save, error on open
09/26/2018 20:21:33;0100;pbs_mom;Job;35.sms3;Obit sent
09/26/2018 20:21:33;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in job_save, error on open
09/26/2018 20:21:33;0100;pbs_mom;Req;;Type 6 request received from root@126.26.136.121:15001, sock=1
09/26/2018 20:21:33;0080;pbs_mom;Job;35.sms3;delete job request received
09/26/2018 20:21:33;0001;pbs_mom;Job;35.sms3;Unable to send delete job request to one or more sisters
09/26/2018 20:21:33;0008;pbs_mom;Job;35.sms3;kill_job
09/26/2018 20:21:33;0080;pbs_mom;Req;req_reject;Reject reply code=15059, aux=0, type=6, from root@126.26.136.121:15001
I’ve re-install all nodes’ pbs. It didn’t work.
I’ve check name resolution with ping & pbs_hostn cmds. The result is correct. By the way, firewalld service on the master node is disabled, and the compute nodes have no firewalld service. Dose this matters?
Now I wonder if I miss some network setting??
I modify the script “select=2” to “select=4” and check the mom_log, all compute nodes alarm same errors.
I’ve collected mom_log from ohpc3-cn2(126.26.136.123) as follow:
[hpc@ohpc3-cn2 mom_logs]$ cat /var/spool/pbs/mom_logs/20180927
09/27/2018 11:01:06;0002;pbs_mom;Svr;Log;Log opened
09/27/2018 11:01:06;0002;pbs_mom;Svr;pbs_mom;pbs_version=14.1.2
09/27/2018 11:01:06;0002;pbs_mom;Svr;pbs_mom;pbs_build=mach=N/A:security=N/A:configure_args=N/A
09/27/2018 11:01:06;0100;pbs_mom;Svr;parse_config;file config
09/27/2018 11:01:06;0002;pbs_mom;Svr;pbs_mom;Adding IP address 126.26.136.121 as authorized
09/27/2018 11:01:06;0002;pbs_mom;n/a;set_restrict_user_maxsys;setting 999
09/27/2018 11:01:06;0002;pbs_mom;n/a;set_restrict_user_maxsys;setting 999
09/27/2018 11:01:06;0002;pbs_mom;n/a;read_config;max_check_poll = 120, min_check_poll = 10
09/27/2018 11:01:06;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP set to use reserved port authentication
09/27/2018 11:01:06;0c06;pbs_mom;TPP;pbs_mom(Main Thread);TPP leaf node names = 127.0.0.1:15003,126.26.136.123:15003
09/27/2018 11:01:06;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Initializing TPP transport Layer
09/27/2018 11:01:06;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Max files allowed = 1024
09/27/2018 11:01:06;0c06;pbs_mom;TPP;pbs_mom(Main Thread);Max files too low - you may want to increase it.
09/27/2018 11:01:06;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP initialization done
09/27/2018 11:01:06;0c06;pbs_mom;TPP;pbs_mom(Main Thread);Single pbs_comm configured, TPP Fault tolerant mode disabled
09/27/2018 11:01:06;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Connecting to pbs_comm sms3:17001
09/27/2018 11:01:06;0002;pbs_mom;Svr;pbs_mom;Adding IP address 127.0.0.1 as authorized
09/27/2018 11:01:06;0002;pbs_mom;Svr;pbs_mom;Adding IP address 126.26.136.123 as authorized
09/27/2018 11:01:06;0002;pbs_mom;Svr;set_checkpoint_path;Using default checkpoint path.
09/27/2018 11:01:06;0002;pbs_mom;Svr;set_checkpoint_path;Setting checkpoint path to /var/spool/pbs/checkpoint/
09/27/2018 11:01:06;0002;pbs_mom;n/a;initialize;pcpus=64, OS reports 64 cpu(s)
09/27/2018 11:01:06;0006;pbs_mom;Fil;pbs_mom;Version 14.1.2, started, initialization type = 0
09/27/2018 11:01:06;0002;pbs_mom;Svr;pbs_mom;Mom pid = 14889 ready, using ports Server:15001 MOM:15002 RM:15003
09/27/2018 11:01:36;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 126.26.136.123:15003 to pbs_comm
09/27/2018 11:01:36;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm sms3:17001
09/27/2018 11:01:36;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net restore handler called
09/27/2018 11:01:36;0002;pbs_mom;Svr;pbs_mom;Restart sent to server at sms3:15001
09/27/2018 11:01:36;0d80;pbs_mom;TPP;pbs_mom(Thread 0);sd 0, Received noroute to dest 126.26.136.121:15001, msg=“tfd=21, pbs_comm:126.26.136.121:17001: Dest not found”
09/27/2018 11:02:43;0002;pbs_mom;Svr;pbs_mom;Hello from server at 126.26.136.121:15001
09/27/2018 11:02:43;0002;pbs_mom;Svr;pbs_mom;Adding IP address 126.26.136.122 as authorized
09/27/2018 12:04:25;0002;pbs_mom;Svr;pbs_mom;Hello from server at 126.26.136.121:15001
09/27/2018 12:04:25;0002;pbs_mom;Svr;pbs_mom;Adding IP address 126.26.136.124 as authorized
09/27/2018 12:04:27;0002;pbs_mom;Svr;pbs_mom;Hello from server at 126.26.136.121:15001
09/27/2018 12:04:27;0002;pbs_mom;Svr;pbs_mom;Adding IP address 126.26.136.125 as authorized
09/27/2018 12:05:11;0021;pbs_mom;Job;1.sms3;rename in job_save failed
09/27/2018 12:05:12;0021;pbs_mom;Job;1.sms3;rename in job_save failed
09/27/2018 12:05:13;0021;pbs_mom;Job;1.sms3;rename in job_save failed
09/27/2018 12:05:14;0021;pbs_mom;Job;1.sms3;rename in job_save failed
09/27/2018 12:05:15;0021;pbs_mom;Job;1.sms3;rename in job_save failed
09/27/2018 12:05:15;0021;pbs_mom;Job;1.sms3;rename in job_save failed
09/27/2018 12:05:15;0021;pbs_mom;Job;1.sms3;rename in job_save failed
09/27/2018 12:05:16;0021;pbs_mom;Job;1.sms3;rename in job_save failed
09/27/2018 12:05:16;0021;pbs_mom;Job;1.sms3;rename in job_save failed
09/27/2018 12:05:16;0021;pbs_mom;Job;1.sms3;rename in job_save failed
09/27/2018 12:05:17;0021;pbs_mom;Job;1.sms3;rename in job_save failed
09/27/2018 12:05:17;0021;pbs_mom;Job;1.sms3;rename in job_save failed
Type 1 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:28;0100;pbs_mom;Req;;Type 3 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:28;0100;pbs_mom;Req;;Type 5 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:28;0008;pbs_mom;Job;1.sms3;job_start_error 15010 from node 126.26.136.124:15003 could not JOIN_JOB successfully
09/27/2018 12:05:28;0008;pbs_mom;Job;1.sms3;kill_job
09/27/2018 12:05:28;0100;pbs_mom;Job;1.sms3;ohpc3-cn1 cput= 0:00:00 mem=0kb
09/27/2018 12:05:28;0100;pbs_mom;Job;1.sms3;ohpc3-cn2 cput= 0:00:00 mem=0kb
09/27/2018 12:05:28;0100;pbs_mom;Job;1.sms3;ohpc3-cn3 cput= 0:00:00 mem=0kb
09/27/2018 12:05:28;0100;pbs_mom;Job;1.sms3;ohpc3-cn4 cput= 0:00:00 mem=0kb
09/27/2018 12:05:28;0008;pbs_mom;Job;1.sms3;job_start_error 15010 from node 126.26.136.125:15003 could not JOIN_JOB successfully
09/27/2018 12:05:18;0021;pbs_mom;Job;1.sms3;rename in job_save failed
09/27/2018 12:05:18;0021;pbs_mom;Job;1.sms3;rename in job_save failed
b_save, error on open
09/27/2018 12:05:28;0100;pbs_mom;Job;1.sms3;Obit sent
09/27/2018 12:05:28;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in job_save, error on open
09/27/2018 12:05:28;0008;pbs_mom;Job;1.sms3;job_start_error 15010 from node 126.26.136.123:15003 could not JOIN_JOB successfully
09/27/2018 12:05:28;0100;pbs_mom;Req;;Type 6 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:28;0080;pbs_mom;Job;1.sms3;delete job request received
09/27/2018 12:05:28;0001;pbs_mom;Job;1.sms3;Unable to send delete job request to one or more sisters
09/27/2018 12:05:28;0008;pbs_mom;Job;1.sms3;kill_job
09/27/2018 12:05:28;0080;pbs_mom;Req;req_reject;Reject reply code=15059, aux=0, type=6, from root@126.26.136.121:15001
09/27/2018 12:05:28;0100;pbs_mom;Req;;Type 1 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:28;0100;pbs_mom;Req;;Type 3 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:28;0100;pbs_mom;Req;;Type 5 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:28;0008;pbs_mom;Job;1.sms3;job_start_error 15010 from node 126.26.136.124:15003 could not JOIN_JOB successfully
09/27/2018 12:05:28;0008;pbs_mom;Job;1.sms3;kill_job
09/27/2018 12:05:28;0100;pbs_mom;Job;1.sms3;ohpc3-cn1 cput= 0:00:00 mem=0kb
09/27/2018 12:05:28;0100;pbs_mom;Job;1.sms3;ohpc3-cn2 cput= 0:00:00 mem=0kb
09/27/2018 12:05:28;0100;pbs_mom;Job;1.sms3;ohpc3-cn3 cput= 0:00:00 mem=0kb
09/27/2018 12:05:28;0100;pbs_mom;Job;1.sms3;ohpc3-cn4 cput= 0:00:00 mem=0kb
09/27/2018 12:05:28;0008;pbs_mom;Job;1.sms3;job_start_error 15010 from node 126.26.136.125:15003 could not JOIN_JOB successfully
09/27/2018 12:05:28;0008;pbs_mom;Job;1.sms3;job_start_error 15010 from node 126.26.136.123:15003 could not JOIN_JOB successfully
09/27/2018 12:05:28;0008;pbs_mom;Job;1.sms3;no active tasks
09/27/2018 12:05:28;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in job_save, error on open
09/27/2018 12:05:28;0100;pbs_mom;Job;1.sms3;Obit sent
09/27/2018 12:05:28;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in job_save, error on open
09/27/2018 12:05:28;0100;pbs_mom;Req;;Type 6 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:28;0080;pbs_mom;Job;1.sms3;delete job request received
09/27/2018 12:05:28;0001;pbs_mom;Job;1.sms3;Unable to send delete job request to one or more sisters
09/27/2018 12:05:28;0008;pbs_mom;Job;1.sms3;kill_job
09/27/2018 12:05:28;0080;pbs_mom;Req;req_reject;Reject reply code=15059, aux=0, type=6, from root@126.26.136.121:15001
09/27/2018 12:05:29;0100;pbs_mom;Req;;Type 1 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:29;0100;pbs_mom;Req;;Type 3 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:29;0100;pbs_mom;Req;;Type 5 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:29;0008;pbs_mom;Job;1.sms3;job_start_error 15010 from node 126.26.136.123:15003 could not JOIN_JOB successfully
09/27/2018 12:05:29;0008;pbs_mom;Job;1.sms3;kill_job
09/27/2018 12:05:29;0100;pbs_mom;Job;1.sms3;ohpc3-cn1 cput= 0:00:00 mem=0kb
09/27/2018 12:05:29;0100;pbs_mom;Job;1.sms3;ohpc3-cn2 cput= 0:00:00 mem=0kb
09/27/2018 12:05:29;0100;pbs_mom;Job;1.sms3;ohpc3-cn3 cput= 0:00:00 mem=0kb
09/27/2018 12:05:29;0100;pbs_mom;Job;1.sms3;ohpc3-cn4 cput= 0:00:00 mem=0kb
09/27/2018 12:05:29;0008;pbs_mom;Job;1.sms3;job_start_error 15010 from node 126.26.136.124:15003 could not JOIN_JOB successfully
09/27/2018 12:05:29;0008;pbs_mom;Job;1.sms3;job_start_error 15010 from node 126.26.136.125:15003 could not JOIN_JOB successfully
09/27/2018 12:05:22;0021;pbs_mom;Job;1.sms3;rename in job_save failed
12:05:29;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in job_save, error on open
09/27/2018 12:05:29;0100;pbs_mom;Job;1.sms3;Obit sent
09/27/2018 12:05:29;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in job_save, error on open
09/27/2018 12:05:29;0100;pbs_mom;Req;;Type 6 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:29;0080;pbs_mom;Job;1.sms3;delete job request received
09/27/2018 12:05:29;0001;pbs_mom;Job;1.sms3;Unable to send delete job request to one or more sisters
09/27/2018 12:05:29;0008;pbs_mom;Job;1.sms3;kill_job
09/27/2018 12:05:29;0080;pbs_mom;Req;req_reject;Reject reply code=15059, aux=0, type=6, from root@126.26.136.121:15001
09/27/2018 12:05:29;0100;pbs_mom;Req;;Type 1 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:29;0100;pbs_mom;Req;;Type 3 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:29;0100;pbs_mom;Req;;Type 5 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:29;0008;pbs_mom;Job;1.sms3;job_start_error 15010 from node 126.26.136.125:15003 could not JOIN_JOB successfully
09/27/2018 12:05:29;0008;pbs_mom;Job;1.sms3;kill_job
09/27/2018 12:05:29;0100;pbs_mom;Job;1.sms3;ohpc3-cn1 cput= 0:00:00 mem=0kb
09/27/2018 12:05:29;0100;pbs_mom;Job;1.sms3;ohpc3-cn2 cput= 0:00:00 mem=0kb
09/27/2018 12:05:29;0100;pbs_mom;Job;1.sms3;ohpc3-cn3 cput= 0:00:00 mem=0kb
09/27/2018 12:05:29;0100;pbs_mom;Job;1.sms3;ohpc3-cn4 cput= 0:00:00 mem=0kb
09/27/2018 12:05:29;0008;pbs_mom;Job;1.sms3;job_start_error 15010 from node 126.26.136.124:15003 could not JOIN_JOB successfully
09/27/2018 12:05:29;0008;pbs_mom;Job;1.sms3;no active tasks
09/27/2018 12:05:29;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in job_save, error on open
09/27/2018 12:05:29;0100;pbs_mom;Job;1.sms3;Obit sent
09/27/2018 12:05:29;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in job_save, error on open
09/27/2018 12:05:29;0008;pbs_mom;Job;1.sms3;job_start_error 15010 from node 126.26.136.123:15003 could not JOIN_JOB successfully
09/27/2018 12:05:29;0100;pbs_mom;Req;;Type 6 request received from root@126.26.136.121:15001, sock=1
09/27/2018 12:05:29;0080;pbs_mom;Job;1.sms3;delete job request received
09/27/2018 12:05:29;0001;pbs_mom;Job;1.sms3;Unable to send delete job request to one or more sisters
09/27/2018 12:05:29;0008;pbs_mom;Job;1.sms3;kill_job
09/27/2018 12:05:29;0080;pbs_mom;Req;req_reject;Reject reply code=15059, aux=0, type=6, from root@126.26.136.121:15001
Please again check the below message ( It seems DNS is working fine with resolution to FQDN and short names on the compute nodes. might be /etc/hosts has been cached ) . We will get there, there is something basic which is hindering the process.
Please make sure
SELinux is disabled (if disabled now , then nodes should be rebooted )
Make sure 15001 to 15007 and 17001 ports are opened for communication within the Cluster (head node to compute node and vice versa and between the compute nodes )
We will check from the the basic
for i in {1…7};do qsub -N HOSTNAME - l select=1:ncpus=1 -l place=excl – /bin/hostname ;done
cat HOSTNAME.o* # should display all the compute node hostnames
cat pbs.sh
#!/bin/bash
env
hostname
sleep 10
chmod +x pbs.sh
for i in {1…7};do qsub -l select=1:ncpus=1 -l place=excl -N PBS pbs.sh ; done
cat PBS.o*
Let us know the above submission worked without any issues and you were able to see the stdout and stderr files.
Thank you for your patience, adarsh.
I’ve done what you suggest. And I’ve pasted some information below. The results are still depressing…
All nodes (include master node and slave nodes) exec ‘getenforce’, and results are ‘Disabled’.
I close all firewalld and iptables service. And the ports info of all nodes is following, I consider it matchs “the Ports Used by PBS in TPP Mode” which I see from PBS Pro Installation Guide:
Request Type 54 is PBS_BATCH_CopyFiles, is here a rcp operation? And “staged 2 items out over 0:00:02” is an error log or not?
I’ve deleted the ohpc3-cn1, and now ohpc3-cn[2-7] are in the cluster. And I run " 1. for i in {1…6};do qsub -N HOSTNAME - l select=1:ncpus=1 -l place=excl – /bin/hostname ;done". The mom_log in ohpc3-cn2 is same with the log in ohpc3-cn1 which I run yesterday. I think the issues consist in the communication between the head compute node and the sister nodes, not a specific node such as ohpc3-cn2.
I download the source code and find the print “staged 2 items out over 0:00:02” in request.c:void req_cpyfile(struct batch_request *preq). The function comments said “process the Copy Files request from the server to dispose of output from the job. This is done by a child of MOM since it might take time”. Are there some issues with copyfile?
There are no issues with copyfile, i have used 14.1.2 and recently upgraded to 18.1.2
It is worth to upgrade to 18.1.2 (if you have any upcoming maintenance scheduled).
By default PBS Pro uses RCP for file copy
Unless you have your configuration below ( which uses SCP )
If you have common mounts across all PBS Pro Complex , then you can use “cp” command .
The configuration of which is in the $PBS_HOME/mom_priv/config and the attribute is $usecp example:
$usecp admin.default.domain:/home /home
$usecp /home admin.default.domain:/home
$usecp admin:/home /home
$usecp /home admin:/home
$usecp admin:/stage /stage
$usecp /stage admin:/stage
Thanks for your advice. I will try 18.1.2 later.
But now I should focus on the issue and solve it, I think this issue still exists when I change to 18.1.2.
Do you have any idea about these issues? Or can I open the debug level log to get more information? Or can I use tcpdump to catch the packet between compute nodes?
I’ve increased the mom log level and exec below cmds (I’ve delete ohpc3-cn1, so it begin with ohpc3-cn2)
then I get the mom log as follow:
09/28/2018 16:31:30;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net restore handler called
09/28/2018 16:31:30;0002;pbs_mom;Svr;pbs_mom;Restart sent to server at sms3:15001
09/28/2018 16:31:30;0002;pbs_mom;Svr;pbs_mom;Hello from server at 126.26.136.121:15001
09/28/2018 16:31:30;0002;pbs_mom;Svr;pbs_mom;Adding IP address 126.26.136.123 as authorized
09/28/2018 16:31:30;0002;pbs_mom;Svr;pbs_mom;Adding IP address 126.26.136.124 as authorized
09/28/2018 16:31:30;0002;pbs_mom;Svr;pbs_mom;Adding IP address 126.26.136.125 as authorized
09/28/2018 16:31:30;0002;pbs_mom;Svr;pbs_mom;Adding IP address 126.26.136.126 as authorized
09/28/2018 16:31:30;0002;pbs_mom;Svr;pbs_mom;Adding IP address 126.26.136.127 as authorized
09/28/2018 16:31:30;0002;pbs_mom;Svr;pbs_mom;Adding IP address 126.26.136.128 as authorized
09/28/2018 16:32:17;0100;pbs_mom;Req;;Type 1 request received from root@126.26.136.121:15001, sock=1
09/28/2018 16:32:17;0100;pbs_mom;Req;;Type 5 request received from root@126.26.136.121:15001, sock=1
09/28/2018 16:32:17;0400;pbs_mom;Node;ohpc3-cn2;implicitly added host to vmap
09/28/2018 16:32:17;0800;pbs_mom;n/a;mom_get_sample;nprocs: 667, cantstat: 0, nomem: 0, skipped: 0, cached: 0, max excluded PID: 0
09/28/2018 16:32:17;0800;pbs_mom;n/a;mom_get_sample;nprocs: 667, cantstat: 0, nomem: 0, skipped: 0, cached: 0, max excluded PID: 0
09/28/2018 16:32:17;0008;pbs_mom;Job;19.sms3;Started, pid = 3111
09/28/2018 16:32:19;0800;pbs_mom;n/a;mom_get_sample;nprocs: 667, cantstat: 0, nomem: 0, skipped: 0, cached: 0, max excluded PID: 0 09/28/2018 16:32:19;0004;pbs_mom;Act;get_wm;libmemacct.so.1 not found
09/28/2018 16:32:19;0080;pbs_mom;Job;19.sms3;task 00000001 terminated
09/28/2018 16:32:19;0800;pbs_mom;n/a;mom_get_sample;nprocs: 666, cantstat: 0, nomem: 0, skipped: 0, cached: 0, max excluded PID: 0
09/28/2018 16:32:19;0008;pbs_mom;Job;19.sms3;Terminated
09/28/2018 16:32:19;0100;pbs_mom;Job;19.sms3;task 00000001 cput= 0:00:00
09/28/2018 16:32:19;0008;pbs_mom;Job;19.sms3;kill_job
09/28/2018 16:32:19;0100;pbs_mom;Job;19.sms3;ohpc3-cn2 cput= 0:00:00 mem=0kb
09/28/2018 16:32:20;0800;pbs_mom;n/a;mom_get_sample;nprocs: 667, cantstat: 0, nomem: 0, skipped: 0, cached: 0, max excluded PID: 0
09/28/2018 16:32:20;0008;pbs_mom;Job;19.sms3;no active tasks
09/28/2018 16:32:20;0100;pbs_mom;Job;19.sms3;Obit sent
09/28/2018 16:32:20;0100;pbs_mom;Req;;Type 54 request received from root@126.26.136.121:15001, sock=1
09/28/2018 16:32:20;0080;pbs_mom;Job;19.sms3;copy file request received
09/28/2018 16:32:23;0100;pbs_mom;Job;19.sms3;staged 2 items out over 0:00:03
09/28/2018 16:32:23;0800;pbs_mom;n/a;mom_get_sample;nprocs: 667, cantstat: 0, nomem: 0, skipped: 0, cached: 0, max excluded PID: 0
09/28/2018 16:32:23;0008;pbs_mom;Job;19.sms3;no active tasks
09/28/2018 16:32:23;0100;pbs_mom;Req;;Type 6 request received from root@126.26.136.121:15001, sock=1
09/28/2018 16:32:23;0080;pbs_mom;Job;19.sms3;delete job request received
09/28/2018 16:32:23;0008;pbs_mom;Job;19.sms3;kill_job
09/28/2018 16:32:23;0800;pbs_mom;n/a;mom_get_sample;nprocs: 667, cantstat: 0, nomem: 0, skipped: 0, cached: 0, max excluded PID: 0
what’s the impact of lack of libmemacct.so.1?
I’ve checked configuration in compute nodes (I mount the ‘/’ of all compute nodes from master node 126.26.136.121"/opt/ohpc/admin/images/centos7.4, so the configurations in all compute nodes are same) and server node. In my configuration, I didn’t have PBS_RSHCOMMAND. Does it matters?
This can be ignored, this is futile (this is looking for SGI’s package - made available in the
libmemacct library available from a new package called memacct )
It does not.
Note: pbs_mpirun to function correctly for users who require the use of ssh instead of rsh