Can not find file (No such file or directory)

Hi All,
I try to submit jobs with qsub. However, I get the following error.

/var/spool/pbs/mom_priv/jobs/198.hep-node0.SC: line 8: cd: /home/ali_0/Madgraph/MG5_aMC_v2_6_7/ttbar: No such file or directory
/var/spool/pbs/mom_priv/jobs/198.hep-node0.SC: line 9: ./run.sh: No such file or directory

run.sh file is executable and located in the same directory where qsub submission is taking place.

Below is the output of my submit.pbs file.
#!/bin/bash
#PBS -l select=1:ncpus=1:mem=2gb
#PBS -l walltime=10:0:0
#PBS -S /bin/bash
cd $PBS_O_WORKDIR
./run.sh 1000 37

Thanks for the help!

Please use the absolute path to run.sh in the script and try again

Also, notice that your cd $PBS_O_WORKDIR failed. This suggests that the directory you submitted the job from is not available at the same path on the node where the job ran. Is the /home filesystem shared across your submission host and all nodes?

Alternatively, did you rename or remove the directory between the time you submitted the job and the time it ran?

Absolute path solved the problem for the time being. Thanks!

Dear @dtalcott , this is the first time I am trying to establish a small cluster system by myself. So had no experience thus far. However, I believe I made great progress so far thanks to the people like you guys.

When I execute the following commands below multiple times sequentially, I see that job is distributed and running across all the nodes. So, I thought nodes can communicate with each other w/o any issues as there is no warning/error messages inside the log files located in mom_logs directory.

qsub -l select=1:ncpus=1 -l place=excl – /bin/sleep 1000
qsub -l select=1:ncpus=1 -l place=excl – /bin/sleep 1000

However, When I am trying to run the program, I plan to use for my research, I see that it is running only on master node. No jobs are submitted to slave nodes. So do you think that “/home” directory on the master node where I submit jobs has to be shared across all the nodes as well? I was not sure whether NFS should be set up or not.
Note: (passwordless ssh login is set and the same user is created across all the nodes as well as master node)

Yes. Usually, all nodes need access to the same user file systems, and NFS is a typical way to arrange that.

The reason your /bin/sleep test jobs worked is that every node has its own copy of /bin/sleep as part of the O/S.

(If you don’t have a shared file system, the job must copy over all the files it needs to each node and copy back any result files–It’s much easier to use a shared file system.)

Thanks again @dtalcott for your prompt reply. I set up NFS. To me, seems like it is working now but I just want to make sure about it. Could you please take a look at the following outputs and confirm that everything works as expected?

This is the output of command qstat -n1:
393.hep-node0 ali_0 batch submit.psb 28062 6 12 12gb 10:00 R 00:20 hep-node0/0*2+hep-node0/1*2+hep-node5/0*2+hep-node5/1*2+hep-node2/0*2+hep-node2/1*2

Head node is called hep-node0. It is also used as mom too.
Log file of the mom is as the following.

03/20/2022 05:58:11;0100;pbs_mom;Req;;Type 1 request received from root@192.168.1.1:15001, sock=1
03/20/2022 05:58:11;0100;pbs_mom;Req;;Type 3 request received from root@192.168.1.1:15001, sock=1
03/20/2022 05:58:11;0100;pbs_mom;Req;;Type 5 request received from root@192.168.1.1:15001, sock=1
03/20/2022 05:58:11;0008;pbs_mom;Job;393.hep-node0;nprocs:  230, cantstat:  0, nomem:  0, skipped:  146, cached:  0
03/20/2022 05:58:11;0008;pbs_mom;Job;393.hep-node0;Started, pid = 28062

One of the execute node is called hep-node2. And its log file is below.

`03/20/2022 06:47:21;0008;pbs_mom;Job;392.hep-node0;JOIN_JOB as node 2
03/20/2022 06:50:46;0008;pbs_mom;Job;392.hep-node0;KILL_JOB received
03/20/2022 06:50:46;0008;pbs_mom;Job;392.hep-node0;kill_job
03/20/2022 06:50:46;0008;pbs_mom;Job;392.hep-node0;DELETE_JOB received
03/20/2022 06:50:46;0008;pbs_mom;Job;392.hep-node0;kill_job
03/20/2022 06:58:11;0008;pbs_mom;Job;393.hep-node0;JOIN_JOB as node 2`

The other slave node is hep-node5. Log file of the node5 is as the following.

03/20/2022 06:47:22;0008;pbs_mom;Job;392.hep-node0;JOIN_JOB as node 1
03/20/2022 06:50:47;0008;pbs_mom;Job;392.hep-node0;KILL_JOB received
03/20/2022 06:50:47;0008;pbs_mom;Job;392.hep-node0;kill_job
03/20/2022 06:50:47;0008;pbs_mom;Job;392.hep-node0;DELETE_JOB received
03/20/2022 06:50:47;0008;pbs_mom;Job;392.hep-node0;kill_job
03/20/2022 06:58:13;0008;pbs_mom;Job;393.hep-node0;JOIN_JOB as node 1

Below is the output of the command pbsnodes -av
[root@hep-node2 ~]# pbsnodes -av

hep-node0
     Mom = hep-node0
     ntype = PBS
     state = job-busy
     pcpus = 4
     jobs = 393.hep-node0/0, 393.hep-node0/1, 393.hep-node0/2, 393.hep-node0/3
     resources_available.arch = linux
     resources_available.host = hep-node0
     resources_available.mem = 16265032kb
     resources_available.ncpus = 4
     resources_available.vnode = hep-node0
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 4194304kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 4
     resources_assigned.vmem = 0kb
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Sun Mar 20 06:58:11 2022
     last_used_time = Sun Mar 20 06:50:46 2022

hep-node5
     Mom = hep-node5
     ntype = PBS
     state = job-busy
     pcpus = 4
     jobs = 393.hep-node0/0, 393.hep-node0/1, 393.hep-node0/2, 393.hep-node0/3
     resources_available.arch = linux
     resources_available.host = hep-node5
     resources_available.mem = 8007616kb
     resources_available.ncpus = 4
     resources_available.vnode = hep-node5
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 4194304kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 4
     resources_assigned.vmem = 0kb
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Sun Mar 20 06:58:11 2022
     last_used_time = Sun Mar 20 06:50:46 2022

hep-node2
     Mom = hep-node2
     ntype = PBS
     state = job-busy
     pcpus = 4
     jobs = 393.hep-node0/0, 393.hep-node0/1, 393.hep-node0/2, 393.hep-node0/3
     resources_available.arch = linux
     resources_available.host = hep-node2
     resources_available.mem = 8007520kb
     resources_available.ncpus = 4
     resources_available.vnode = hep-node2
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 4194304kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 4
     resources_assigned.vmem = 0kb
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Sun Mar 20 06:58:11 2022
     last_used_time = Sun Mar 20 06:50:46 2022

One thing I noticed in the outputs of the log file of node2 and node5. Although in the output of node2, it reads as ....JOIN_JOB as node 2, it says ....JOIN_JOB as node 1 in the output of log file of hep-node5, it does not say anything similar in the mom log of node0. Outputs are different.

Is there also a dedicated command to check if the job is running on a specific slave node?

Thanks in advance,

Looking at the qstat output, you see that the job was assigned to hep-node0, hep-node5, and hep-node2 in that order. So, the job’s first node (node 0) is hep-node0. The second node (node 1) is hep-node5. The third node (node 2) is hep-node2. The job starts on hep-node0, which then tells the other two nodes to join the job. This is why you get JOIN_JOB messages only from the later nodes.

No dedicated command that I remember, but pbsnodes node_name | grep jobs will tell you which jobs are assigned to a node. However, it doesn’t tell you if there are any processes for the job currently active on the node.

If you are running the pbs_cgroups hook, you can use it to get the PIDs of all processes associated with the job:

dtalcott@server2:~> jobid=$(echo /bin/sleep 1000 | qsub -l select=1)
dtalcott@server2:~> echo $jobid
14152.server2
dtalcott@server2:~> ssh node3 "ps -f -p \$(< /sys/fs/cgroup/systemd/pbs_jobs.service/jobid/$jobid/tasks )"
UID        PID  PPID  C STIME TTY      STAT   TIME CMD
dtalcott 17491  2347  0 15:22 ?        Ss     0:00 -bash
dtalcott 17557 17491  0 15:22 ?        S      0:00 -bash
dtalcott 17558 17557  0 15:22 ?        S      0:00 /bin/sleep 1000
1 Like