MPI job shows running but with 00:00:00 time

Hello All,

I have run into a very strange problem. When I qsub an mpi job it shows that it is running however with 00:00;00 time. The prologue script runs and pbsnodes shows that the job has been assigned to the nodes; however none of the processes are not running on any compute nodes. I know that the mpi program is good, as I can submit the mpi job to the compute nodes without going through the job scheduler. I can qsub a serial jobs with no problems at all although the serial jobs only want t run on one node. Has anybody ever run into this problem?
Thanks

Could you please share your MPI script that is been submitted as a job.

Listed below is the script. Also I did a little more research. The processes are on the compute nodes and they are all in the sleep mode.

#PBS -N MyTest
#PBS -q medium
#PBS -l nodes=10;ppn=1
#PBS -j oe
#PBS -l cput=00:05:00

source /opt/intel/mpivars.sh

mpirun uname -n

Hi Adarsh,

I did some more testing. If I remove #PBS -l nodes=10;ppn=1 and change the mpirun line to the following
mpirun -np 10 --host n01,n02 uname -n
it runs.
If I still keep the #PBS -l nodes=10;ppn=1 out of file file and run as such
mpirun -np 10 -machine myhosts uname -n
it fails, the processes go to sleep. The myhosts file just consist of
n01
n02
even added slots
n01:5
n02:5
still the same results, the mpi processes sleep

Dave

Please check whether this works fine on its own
mpirun -np 10 -machine hosts.txt uname -n

cat hosts.txt should contain 10 lines with the hostname of the compute node
for example:
cat hosts.txt
node01
node01
node01
node01
node01
node01
node01
node01
node01
node01

Please update your script as below

#PBS -N MyTest
#PBS -q medium
#PBS -l select=10:ncpus=1:mpiprocs=1
#PBS -j oe
#PBS -l walltime=00:05:00
cd $PBS_O_WORKDIR
source /opt/intel/mpivars.sh
mpirun -np 10 -machine $PBS_NODEFILE uname -n

Please note you have requested the cput, if there is no utilisation then there would not be any updates.
Instead request for walltime and run the below script

Hello adarsh,

The above worked PBS script worked. After running it I changed my hosts.txt file to include node01 … node10 and that worked as well. So than I moved on to the real problem and submitted a large matrix problem using all 10 nodes. My select line looked as such
#PBS select=10:ncpus=20:mpiprococs=4
I am guessing the above means 10 nodes, 20 cores and 4 mpi processes?

That ran to completion. Yes, however when checking qstat my time always showed 00:00:00 even though the processes were running and should be accumulating cpu time?

Thanks,

Dave

Could you please try the below:

  1. yum install stress -y
  2. As a standard user run the below:
qsub -l select=1:ncpus=1 -l cput=00:01:00  -- /bin/stress --cpu 2 --timeout 10
qstat -f <jobid> | grep cput
  1. Please try this and find out whether you get some cput utilisation (with -l cput=01:00:00 )
    Job not getting distributed among nodes - #17 by adarsh

Hi adarsh,

Please see the output for the command below :

qstat -f 22 | grep cput

resources_used.cput = 00:00:00
Resources_List.cput = 00:10:00

Please find my test on one node :

[pbsdata@pbspro~]$ date;qsub -l select=1:ncpus=1 -l cput=00:01:00  -- /bin/stress --cpu 2 --timeout 10
Wed  3 Aug 20:12:30 BST 2022
5111.uklm-pbstest
[pbsdata@pbspro~]$ date;qstat -fx 5111 | grep cput
Wed  3 Aug 20:12:34 BST 2022
    resources_used.cput = 00:00:00
    Resource_List.cput = 00:01:00
    Submit_arguments = -l select=1:ncpus=1 -l cput=00:01:00 -- /bin/stress --cp
[pbsdata@pbspro~]$ date;qstat -fx 5111 | grep cput
Wed  3 Aug 20:12:35 BST 2022
    resources_used.cput = 00:00:00
    Resource_List.cput = 00:01:00
    Submit_arguments = -l select=1:ncpus=1 -l cput=00:01:00 -- /bin/stress --cp
[pbsdata@pbspro~]$ date;qstat -fx 5111 | grep cput
Wed  3 Aug 20:12:36 BST 2022
    resources_used.cput = 00:00:00
    Resource_List.cput = 00:01:00
    Submit_arguments = -l select=1:ncpus=1 -l cput=00:01:00 -- /bin/stress --cp
[pbsdata@pbspro~]$ date;qstat -fx 5111 | grep cput
Wed  3 Aug 20:12:40 BST 2022
    resources_used.cput = 00:00:00
    Resource_List.cput = 00:01:00
    Submit_arguments = -l select=1:ncpus=1 -l cput=00:01:00 -- /bin/stress --cp
[pbsdata@pbspro~]$ date;qstat -fx 5111 | grep cput
Wed  3 Aug 20:12:42 BST 2022
    resources_used.cput = 00:00:20
    Resource_List.cput = 00:01:00
    Submit_arguments = -l select=1:ncpus=1 -l cput=00:01:00 -- /bin/stress --cp
[pbsdata@pbspro~]$ date;qstat -fx 5111 | grep cput
Wed  3 Aug 20:12:43 BST 2022
    resources_used.cput = 00:00:20
    Resource_List.cput = 00:01:00
    Submit_arguments = -l select=1:ncpus=1 -l cput=00:01:00 -- /bin/stress --cp
[pbsdata@pbspro~]$ qstat

Please find my test on multiple nodes :

[pbsdata@demo ~]$ cat stresscput.sh 
#PBS -N stress
#PBS -l select=2:ncpus=4:mpiprocs=4
#PBS -l cput=00:10:00

cd $PBS_O_WORKDIR

total_cores=`cat $PBS_NODEFILE | wc -l `
echo "total_cores=$total_cores"

total_hosts=`cat $PBS_NODEFILE | uniq | wc -l`
echo "total_hosts=$total_hosts"

cores_per_host=$((total_cores / total_hosts))
echo "cores_per_host=$cores_per_host"



echo "running stress"
echo "/opt/pbs/bin/pbsdsh -- stress --cpu $cores_per_host --timeout 60s"
/opt/pbs/bin/pbsdsh -- stress --cpu $cores_per_host  --timeout 60s
echo "ending stress"

[pbsdata@demo ~]$ date; qstat -answ1 ; qsub stresscput.sh
Wed  3 Aug 20:46:43 BST 2022
23420.demo
[pbsdata@demo ~]$ date ; qstat -answ1
Wed  3 Aug 20:46:47 BST 2022

demo: 
                                                                                                   Req'd  Req'd   Elap
Job ID                         Username        Queue           Jobname         SessID   NDS  TSK   Memory Time  S Time
------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----
23420.demo                     pbsdata         workq           stress              6988    2     8    --  00:10 R 00:00 demo/0*4+cnode1/0*4
   Job run at Wed Aug 03 at 20:46 on (demo:ncpus=4)+(cnode1:ncpus=4)
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed  3 Aug 20:47:05 BST 2022
    resources_used.cput = 00:00:42
    Resource_List.cput = 00:10:00
    Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed  3 Aug 20:47:07 BST 2022
    resources_used.cput = 00:00:42
    Resource_List.cput = 00:10:00
    Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed  3 Aug 20:47:08 BST 2022
    resources_used.cput = 00:00:42
    Resource_List.cput = 00:10:00
    Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed  3 Aug 20:47:10 BST 2022
    resources_used.cput = 00:00:42
    Resource_List.cput = 00:10:00
    Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed  3 Aug 20:47:11 BST 2022
    resources_used.cput = 00:00:42
    Resource_List.cput = 00:10:00
    Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed  3 Aug 20:47:12 BST 2022
    resources_used.cput = 00:00:42
    Resource_List.cput = 00:10:00
    Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed  3 Aug 20:47:13 BST 2022
    resources_used.cput = 00:01:44
    Resource_List.cput = 00:10:00
    Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed  3 Aug 20:47:16 BST 2022
    resources_used.cput = 00:01:44
    Resource_List.cput = 00:10:00
    Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed  3 Aug 20:47:17 BST 2022
    resources_used.cput = 00:01:44
    Resource_List.cput = 00:10:00
    Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed  3 Aug 20:47:18 BST 2022
    resources_used.cput = 00:01:44
    Resource_List.cput = 00:10:00
    Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
23420.demo        stress           pbsdata           00:01:44 R workq           
[pbsdata@demo ~]$ qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
23420.demo        stress           pbsdata           00:01:44 R workq           
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed  3 Aug 20:47:25 BST 2022
    resources_used.cput = 00:01:44
    Resource_List.cput = 00:10:00
    Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed  3 Aug 20:47:27 BST 2022
    resources_used.cput = 00:01:44
    Resource_List.cput = 00:10:00
    Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed  3 Aug 20:47:29 BST 2022
    resources_used.cput = 00:01:44
    Resource_List.cput = 00:10:00
    Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed  3 Aug 20:47:30 BST 2022
    resources_used.cput = 00:01:44
    Resource_List.cput = 00:10:00
    Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed  3 Aug 20:47:32 BST 2022
    resources_used.cput = 00:01:44
    Resource_List.cput = 00:10:00
    Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed  3 Aug 20:47:36 BST 2022
    resources_used.cput = 00:03:12
    Resource_List.cput = 00:10:00
    Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed  3 Aug 20:47:37 BST 2022
    resources_used.cput = 00:03:12
    Resource_List.cput = 00:10:00
    Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed  3 Aug 20:47:39 BST 2022
    resources_used.cput = 00:03:12
    Resource_List.cput = 00:10:00
    Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
23420.demo        stress           pbsdata           00:03:12 R workq           
[pbsdata@demo ~]$ qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
23420.demo        stress           pbsdata           00:03:12 R workq           
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed  3 Aug 20:47:47 BST 2022
    resources_used.cput = 00:07:51
    Resource_List.cput = 00:10:00
    Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed  3 Aug 20:47:48 BST 2022
    resources_used.cput = 00:07:51
    Resource_List.cput = 00:10:00
    Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed  3 Aug 20:47:50 BST 2022
    resources_used.cput = 00:07:51
    Resource_List.cput = 00:10:00
    Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed  3 Aug 20:47:52 BST 2022
    resources_used.cput = 00:07:51
    Resource_List.cput = 00:10:00
    Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed  3 Aug 20:47:53 BST 2022
    resources_used.cput = 00:07:51
    Resource_List.cput = 00:10:00
    Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed  3 Aug 20:47:54 BST 2022
    resources_used.cput = 00:07:51
    Resource_List.cput = 00:10:00
    Submit_arguments = stresscput.sh
[pbsdata@demo ~]$ qstat
[pbsdata@demo ~]$ date; qstat -fx 23420 | grep -i cput
Wed  3 Aug 20:48:05 BST 2022
    resources_used.cput = 00:07:51
    Resource_List.cput = 00:10:00
    Submit_arguments = stresscput.sh

Please see the PBS Professional User’s Guide, section 4.3, “Requesting Resources”, p. UG-53, especially section 4.3.3, “Requesting Resources in Chunks”.