Thanks again @dtalcott for your prompt reply. I set up NFS. To me, seems like it is working now but I just want to make sure about it. Could you please take a look at the following outputs and confirm that everything works as expected?
This is the output of command qstat -n1:
393.hep-node0 ali_0 batch submit.psb 28062 6 12 12gb 10:00 R 00:20 hep-node0/0*2+hep-node0/1*2+hep-node5/0*2+hep-node5/1*2+hep-node2/0*2+hep-node2/1*2
Head node is called hep-node0. It is also used as mom too.
Log file of the mom is as the following.
03/20/2022 05:58:11;0100;pbs_mom;Req;;Type 1 request received from root@192.168.1.1:15001, sock=1
03/20/2022 05:58:11;0100;pbs_mom;Req;;Type 3 request received from root@192.168.1.1:15001, sock=1
03/20/2022 05:58:11;0100;pbs_mom;Req;;Type 5 request received from root@192.168.1.1:15001, sock=1
03/20/2022 05:58:11;0008;pbs_mom;Job;393.hep-node0;nprocs: 230, cantstat: 0, nomem: 0, skipped: 146, cached: 0
03/20/2022 05:58:11;0008;pbs_mom;Job;393.hep-node0;Started, pid = 28062
One of the execute node is called hep-node2. And its log file is below.
`03/20/2022 06:47:21;0008;pbs_mom;Job;392.hep-node0;JOIN_JOB as node 2
03/20/2022 06:50:46;0008;pbs_mom;Job;392.hep-node0;KILL_JOB received
03/20/2022 06:50:46;0008;pbs_mom;Job;392.hep-node0;kill_job
03/20/2022 06:50:46;0008;pbs_mom;Job;392.hep-node0;DELETE_JOB received
03/20/2022 06:50:46;0008;pbs_mom;Job;392.hep-node0;kill_job
03/20/2022 06:58:11;0008;pbs_mom;Job;393.hep-node0;JOIN_JOB as node 2`
The other slave node is hep-node5. Log file of the node5 is as the following.
03/20/2022 06:47:22;0008;pbs_mom;Job;392.hep-node0;JOIN_JOB as node 1
03/20/2022 06:50:47;0008;pbs_mom;Job;392.hep-node0;KILL_JOB received
03/20/2022 06:50:47;0008;pbs_mom;Job;392.hep-node0;kill_job
03/20/2022 06:50:47;0008;pbs_mom;Job;392.hep-node0;DELETE_JOB received
03/20/2022 06:50:47;0008;pbs_mom;Job;392.hep-node0;kill_job
03/20/2022 06:58:13;0008;pbs_mom;Job;393.hep-node0;JOIN_JOB as node 1
Below is the output of the command pbsnodes -av
[root@hep-node2 ~]# pbsnodes -av
hep-node0
Mom = hep-node0
ntype = PBS
state = job-busy
pcpus = 4
jobs = 393.hep-node0/0, 393.hep-node0/1, 393.hep-node0/2, 393.hep-node0/3
resources_available.arch = linux
resources_available.host = hep-node0
resources_available.mem = 16265032kb
resources_available.ncpus = 4
resources_available.vnode = hep-node0
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 4194304kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 4
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
last_state_change_time = Sun Mar 20 06:58:11 2022
last_used_time = Sun Mar 20 06:50:46 2022
hep-node5
Mom = hep-node5
ntype = PBS
state = job-busy
pcpus = 4
jobs = 393.hep-node0/0, 393.hep-node0/1, 393.hep-node0/2, 393.hep-node0/3
resources_available.arch = linux
resources_available.host = hep-node5
resources_available.mem = 8007616kb
resources_available.ncpus = 4
resources_available.vnode = hep-node5
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 4194304kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 4
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
last_state_change_time = Sun Mar 20 06:58:11 2022
last_used_time = Sun Mar 20 06:50:46 2022
hep-node2
Mom = hep-node2
ntype = PBS
state = job-busy
pcpus = 4
jobs = 393.hep-node0/0, 393.hep-node0/1, 393.hep-node0/2, 393.hep-node0/3
resources_available.arch = linux
resources_available.host = hep-node2
resources_available.mem = 8007520kb
resources_available.ncpus = 4
resources_available.vnode = hep-node2
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 4194304kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 4
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
last_state_change_time = Sun Mar 20 06:58:11 2022
last_used_time = Sun Mar 20 06:50:46 2022
One thing I noticed in the outputs of the log file of node2 and node5. Although in the output of node2, it reads as ....JOIN_JOB as node 2
, it says ....JOIN_JOB as node 1
in the output of log file of hep-node5, it does not say anything similar in the mom log of node0. Outputs are different.
Is there also a dedicated command to check if the job is running on a specific slave node?
Thanks in advance,