I have a cluster, 28 nodes, 40 cores per node for a total of 1120 cores.
User 1 submit a job that asks for 1000 cores
user 2 submits a job that ask for 400 cores while User 1’s job is running
User 2 doesn’t get queued up, instead it tries to run and aborts.
Am I missing a setting or should PBS queue up the job until resources are available?
Thanks.
Are you sure that HT is not enabled on these nodes ?
Please share the below:
job script or qsub statement used by User1 and User2
qstat -fx <jobid of user1> and qstat -fx <jobid user2>
pbsnodes -aSj command output
Submit User1 job and then when it is running, run the below scripts and share the output
pbsnodes -av | grep resources_available.ncpus | cut -d’=’ -f2|awk ‘{ sum+=$1} END {print sum}’
pbsnodes -av | grep resources_assigned.ncpus | cut -d’=’ -f2|awk ‘{ sum+=$1} END {print sum}’
Users runs jobs at anytime just fine. I don’t know where his job submission script is. User2 can run fine if no one else is running. But if user2 submit his job while others are running it immediately aborts. User1 is using 1000 of the 1120 cores, so User2’s job should be held, but I don’t think he has constructed his submission script correctly.
ppn=40 is a old syntax was used in PBS Pro version 9 and before .
select and ncpus are correct here.
PBS converts old-style resource requests to select and place statements.
See the 19.2.1 UG, section 4.8.3, “Conversion of Old Style to New”, on page UG-72
Please note that the exit status is 1 ( which means the application batch command that the user intended to run via the PBS Script failed ) , hence job has exited . In this scenario, could you please test whether you can run the user script without using PBS Pro on the compute nodes and check whether it executes successfully.
Thank you for sharing the information.
You can print the script the user has submitted by running the below command as a root user printjob -s <jobid>
The script will run, if it is the only job running. that is the root of the problem. The script will run by itself, but not if the other user is running his. Here is the mom log of one of the compute nodes:
If this fails, then you have to share us pbs_diag output by running $PBS_EXEC/unsupported/pbs_diag -j <failed jobid> , the output of this command is a tar.gz file stored in the /root/ folder.
Thank you. There are no issues with the cluster configuration. It is to do with the job script that user is submitting, which is failing instantly. If you can share the job script (removing any classifieds) that is failing immediately we can check.