Hello All,
I have run into a very strange problem. When I qsub an mpi job it shows that it is running however with 00:00;00 time. The prologue script runs and pbsnodes shows that the job has been assigned to the nodes; however none of the processes are not running on any compute nodes. I know that the mpi program is good, as I can submit the mpi job to the compute nodes without going through the job scheduler. I can qsub a serial jobs with no problems at all although the serial jobs only want t run on one node. Has anybody ever run into this problem?
Thanks
adarsh
July 28, 2022, 7:14am
2
Could you please share your MPI script that is been submitted as a job.
Listed below is the script. Also I did a little more research. The processes are on the compute nodes and they are all in the sleep mode.
#PBS -N MyTest
#PBS -q medium
#PBS -l nodes=10;ppn=1
#PBS -j oe
#PBS -l cput=00:05:00
source /opt/intel/mpivars.sh
mpirun uname -n
Hi Adarsh,
I did some more testing. If I remove #PBS -l nodes=10;ppn=1 and change the mpirun line to the following
mpirun -np 10 --host n01,n02 uname -n
it runs.
If I still keep the #PBS -l nodes=10;ppn=1 out of file file and run as such
mpirun -np 10 -machine myhosts uname -n
it fails, the processes go to sleep. The myhosts file just consist of
n01
n02
even added slots
n01:5
n02:5
still the same results, the mpi processes sleep
Dave
adarsh
July 29, 2022, 6:56am
5
Please check whether this works fine on its own
mpirun -np 10 -machine hosts.txt uname -n
cat hosts.txt should contain 10 lines with the hostname of the compute node
for example:
cat hosts.txt
node01
node01
node01
node01
node01
node01
node01
node01
node01
node01
Please update your script as below
#PBS -N MyTest
#PBS -q medium
#PBS -l select=10:ncpus=1:mpiprocs=1
#PBS -j oe
#PBS -l walltime=00:05:00
cd $PBS_O_WORKDIR
source /opt/intel/mpivars.sh
mpirun -np 10 -machine $PBS_NODEFILE uname -n
Please note you have requested the cput, if there is no utilisation then there would not be any updates.
Instead request for walltime and run the below script
Hello adarsh,
The above worked PBS script worked. After running it I changed my hosts.txt file to include node01 … node10 and that worked as well. So than I moved on to the real problem and submitted a large matrix problem using all 10 nodes. My select line looked as such
#PBS select=10:ncpus=20:mpiprococs=4
I am guessing the above means 10 nodes, 20 cores and 4 mpi processes?
That ran to completion. Yes, however when checking qstat my time always showed 00:00:00 even though the processes were running and should be accumulating cpu time?
Thanks,
Dave
adarsh
August 2, 2022, 5:26pm
7
Could you please try the below:
yum install stress -y
As a standard user run the below:
qsub -l select=1:ncpus=1 -l cput=00:01:00 -- /bin/stress --cpu 2 --timeout 10
qstat -f <jobid> | grep cput
Please try this and find out whether you get some cput utilisation (with -l cput=01:00:00 )
Job not getting distributed among nodes - #17 by adarsh
Hi adarsh,
Please see the output for the command below :
qstat -f 22 | grep cput
resources_used.cput = 00:00:00
Resources_List.cput = 00:10:00
adarsh
August 3, 2022, 7:49pm
9
Please find my test on one node :
[pbsdata@pbspro~]$ date;qsub -l select=1:ncpus=1 -l cput=00:01:00 -- /bin/stress --cpu 2 --timeout 10
Wed 3 Aug 20:12:30 BST 2022
5111.uklm-pbstest
[pbsdata@pbspro~]$ date;qstat -fx 5111 | grep cput
Wed 3 Aug 20:12:34 BST 2022
resources_used.cput = 00:00:00
Resource_List.cput = 00:01:00
Submit_arguments = -l select=1:ncpus=1 -l cput=00:01:00 -- /bin/stress --cp
[pbsdata@pbspro~]$ date;qstat -fx 5111 | grep cput
Wed 3 Aug 20:12:35 BST 2022
resources_used.cput = 00:00:00
Resource_List.cput = 00:01:00
Submit_arguments = -l select=1:ncpus=1 -l cput=00:01:00 -- /bin/stress --cp
[pbsdata@pbspro~]$ date;qstat -fx 5111 | grep cput
Wed 3 Aug 20:12:36 BST 2022
resources_used.cput = 00:00:00
Resource_List.cput = 00:01:00
Submit_arguments = -l select=1:ncpus=1 -l cput=00:01:00 -- /bin/stress --cp
[pbsdata@pbspro~]$ date;qstat -fx 5111 | grep cput
Wed 3 Aug 20:12:40 BST 2022
resources_used.cput = 00:00:00
Resource_List.cput = 00:01:00
Submit_arguments = -l select=1:ncpus=1 -l cput=00:01:00 -- /bin/stress --cp
[pbsdata@pbspro~]$ date;qstat -fx 5111 | grep cput
Wed 3 Aug 20:12:42 BST 2022
resources_used.cput = 00:00:20
Resource_List.cput = 00:01:00
Submit_arguments = -l select=1:ncpus=1 -l cput=00:01:00 -- /bin/stress --cp
[pbsdata@pbspro~]$ date;qstat -fx 5111 | grep cput
Wed 3 Aug 20:12:43 BST 2022
resources_used.cput = 00:00:20
Resource_List.cput = 00:01:00
Submit_arguments = -l select=1:ncpus=1 -l cput=00:01:00 -- /bin/stress --cp
[pbsdata@pbspro~]$ qstat
Please find my test on multiple nodes :
[pbsdata@demo ~]$ cat stresscput.sh
#PBS -N stress
#PBS -l select=2:ncpus=4:mpiprocs=4
#PBS -l cput=00:10:00
cd $PBS_O_WORKDIR
total_cores=`cat $PBS_NODEFILE | wc -l `
echo "total_cores=$total_cores"
total_hosts=`cat $PBS_NODEFILE | uniq | wc -l`
echo "total_hosts=$total_hosts"
cores_per_host=$((total_cores / total_hosts))
echo "cores_per_host=$cores_per_host"
echo "running stress"
echo "/opt/pbs/bin/pbsdsh -- stress --cpu $cores_per_host --timeout 60s"
/opt/pbs/bin/pbsdsh -- stress --cpu $cores_per_host --timeout 60s
echo "ending stress"
[pbsdata@demo ~]$ date; qstat -answ1 ; qsub stresscput.sh
Wed 3 Aug 20:46:43 BST 2022
23420.demo
[pbsdata@demo ~]$ date ; qstat -answ1
Wed 3 Aug 20:46:47 BST 2022
demo:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----
23420.demo pbsdata workq stress 6988 2 8 -- 00:10 R 00:00 demo/0*4+cnode1/0*4
Job run at Wed Aug 03 at 20:46 on (demo:ncpus=4)+(cnode1:ncpus=4)
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed 3 Aug 20:47:05 BST 2022
resources_used.cput = 00:00:42
Resource_List.cput = 00:10:00
Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed 3 Aug 20:47:07 BST 2022
resources_used.cput = 00:00:42
Resource_List.cput = 00:10:00
Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed 3 Aug 20:47:08 BST 2022
resources_used.cput = 00:00:42
Resource_List.cput = 00:10:00
Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed 3 Aug 20:47:10 BST 2022
resources_used.cput = 00:00:42
Resource_List.cput = 00:10:00
Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed 3 Aug 20:47:11 BST 2022
resources_used.cput = 00:00:42
Resource_List.cput = 00:10:00
Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed 3 Aug 20:47:12 BST 2022
resources_used.cput = 00:00:42
Resource_List.cput = 00:10:00
Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed 3 Aug 20:47:13 BST 2022
resources_used.cput = 00:01:44
Resource_List.cput = 00:10:00
Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed 3 Aug 20:47:16 BST 2022
resources_used.cput = 00:01:44
Resource_List.cput = 00:10:00
Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed 3 Aug 20:47:17 BST 2022
resources_used.cput = 00:01:44
Resource_List.cput = 00:10:00
Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed 3 Aug 20:47:18 BST 2022
resources_used.cput = 00:01:44
Resource_List.cput = 00:10:00
Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
23420.demo stress pbsdata 00:01:44 R workq
[pbsdata@demo ~]$ qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
23420.demo stress pbsdata 00:01:44 R workq
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed 3 Aug 20:47:25 BST 2022
resources_used.cput = 00:01:44
Resource_List.cput = 00:10:00
Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed 3 Aug 20:47:27 BST 2022
resources_used.cput = 00:01:44
Resource_List.cput = 00:10:00
Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed 3 Aug 20:47:29 BST 2022
resources_used.cput = 00:01:44
Resource_List.cput = 00:10:00
Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed 3 Aug 20:47:30 BST 2022
resources_used.cput = 00:01:44
Resource_List.cput = 00:10:00
Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed 3 Aug 20:47:32 BST 2022
resources_used.cput = 00:01:44
Resource_List.cput = 00:10:00
Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed 3 Aug 20:47:36 BST 2022
resources_used.cput = 00:03:12
Resource_List.cput = 00:10:00
Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed 3 Aug 20:47:37 BST 2022
resources_used.cput = 00:03:12
Resource_List.cput = 00:10:00
Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed 3 Aug 20:47:39 BST 2022
resources_used.cput = 00:03:12
Resource_List.cput = 00:10:00
Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
23420.demo stress pbsdata 00:03:12 R workq
[pbsdata@demo ~]$ qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
23420.demo stress pbsdata 00:03:12 R workq
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed 3 Aug 20:47:47 BST 2022
resources_used.cput = 00:07:51
Resource_List.cput = 00:10:00
Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed 3 Aug 20:47:48 BST 2022
resources_used.cput = 00:07:51
Resource_List.cput = 00:10:00
Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed 3 Aug 20:47:50 BST 2022
resources_used.cput = 00:07:51
Resource_List.cput = 00:10:00
Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed 3 Aug 20:47:52 BST 2022
resources_used.cput = 00:07:51
Resource_List.cput = 00:10:00
Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed 3 Aug 20:47:53 BST 2022
resources_used.cput = 00:07:51
Resource_List.cput = 00:10:00
Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed 3 Aug 20:47:54 BST 2022
resources_used.cput = 00:07:51
Resource_List.cput = 00:10:00
Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ qstat
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed 3 Aug 20:48:05 BST 2022
resources_used.cput = 00:07:51
Resource_List.cput = 00:10:00
Submit_arguments = stresscput.sh
Please see the PBS Professional User’s Guide, section 4.3, “Requesting Resources”, p. UG-53, especially section 4.3.3, “Requesting Resources in Chunks”.