Schedulers doesn't seem to be holding jobs

brownwrap · June 15, 2019, 12:46am

I have a cluster, 28 nodes, 40 cores per node for a total of 1120 cores.

User 1 submit a job that asks for 1000 cores
user 2 submits a job that ask for 400 cores while User 1’s job is running
User 2 doesn’t get queued up, instead it tries to run and aborts.
Am I missing a setting or should PBS queue up the job until resources are available?
Thanks.

adarsh · June 15, 2019, 7:27am

Are you sure that HT is not enabled on these nodes ?
Please share the below:

job script or qsub statement used by User1 and User2
qstat -fx <jobid of user1> and qstat -fx <jobid user2>
pbsnodes -aSj command output
Submit User1 job and then when it is running, run the below scripts and share the output
pbsnodes -av | grep resources_available.ncpus | cut -d’=’ -f2|awk ‘{ sum+=$1} END {print sum}’
pbsnodes -av | grep resources_assigned.ncpus | cut -d’=’ -f2|awk ‘{ sum+=$1} END {print sum}’

brownwrap · June 15, 2019, 5:16pm

Neither of the users is around today, so I can’t get them to submit their jobs, but I noticed something in the user2’s job that looks odd:

#PBS -l select=20:ncpus=40:mpiprocs=40,walltime=1:00:00

Users runs jobs at anytime just fine. I don’t know where his job submission script is. User2 can run fine if no one else is running. But if user2 submit his job while others are running it immediately aborts. User1 is using 1000 of the 1120 cores, so User2’s job should be held, but I don’t think he has constructed his submission script correctly.

adarsh · June 15, 2019, 9:52pm

Thank you please share the information whenever possible.

The job request is correct, user2 is requesting 800 cores with a job walltime of 1 hour.

Could you please share the PBS Pro OSS version you are running ? ( e.g., qstat --version )

brownwrap · June 15, 2019, 10:37pm

Here is the version:

[ramos@sandy1 ~] qstat --version pbs_version = 14.1.2 [ramos@sandy1 ~]

I was thinking the line should be:

#PBS -l select=20:ppn=40,walltime=1:00:00

adarsh · June 16, 2019, 6:46am

Thank you

ppn=40 is a old syntax was used in PBS Pro version 9 and before .
select and ncpus are correct here.

PBS converts old-style resource requests to select and place statements.
See the 19.2.1 UG, section 4.8.3, “Conversion of Old Style to New”, on page UG-72

brownwrap · June 16, 2019, 6:04pm

Thank you. I want to rule out a format issue. Still the weekend, so no users are in, but here is a job that immediately fails:

more job-15170
06/14/2019 15:26:34 L Considering job to run
06/14/2019 15:26:34 S Job Queued at request of viner@sandy1.local, owner = viner@sandy1.local, job name = ColdForecast.20, queue = workq
06/14/2019 15:26:34 S Job Run at request of Scheduler@smaster1 on exec_vnode
(compute-0-13:ncpus=40)+(compute-0-14:ncpus=40)+(compute-0-15:ncpus=40)+(compute-0-16:ncpus=40)+(compute-0-17:ncpus=40)+(compute-0-8:ncpus=40)+(compute-0-18:ncpus=
40)+(compute-0-19:ncpus=40)+(compute-0-20:ncpus=40)+(compute-0-21:ncpus=40)
06/14/2019 15:26:34 S Job Modified at request of Scheduler@smaster1
06/14/2019 15:26:34 L Job run
06/14/2019 15:26:34 S enqueuing into workq, state 1 hop 1
06/14/2019 15:26:34 S Obit received momhop:1 serverhop:1 state:4 substate:42
06/14/2019 15:26:34 A queue=workq
06/14/2019 15:26:34 A user=viner group=viner account=“None” project=_pbs_project_default jobname=ColdForecast.20 queue=workq ctime=1560551194 qtime=1560551194 etime=1560551194
start=1560551194
exec_host=compute-0-13/040+compute-0-14/040+compute-0-15/040+compute-0-16/040+compute-0-17/040+compute-0-8/040+compute-0-18/040+compute-0-19/040+compute-0-
20/040+compute-0-21/040
exec_vnode=(compute-0-13:ncpus=40)+(compute-0-14:ncpus=40)+(compute-0-15:ncpus=40)+(compute-0-16:ncpus=40)+(compute-0-17:ncpus=40)+(compute-0-8:ncpus=40)+(compute-
0-18:ncpus=40)+(compute-0-19:ncpus=40)+(compute-0-20:ncpus=40)+(compute-0-21:ncpus=40)
Resource_List.mpiprocs=400 Resource_List.ncpus=400 Resource_List.nodect=10 Resource_List.place=free Resource_List.select=10:ncpus=40:mpiprocs=40
Resource_List.walltime=01:00:00 resource_assigned.ncpus=400
06/14/2019 15:26:35 S Exit_status=1 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=412kb resources_used.ncpus=400 resources_used.vmem=12864kb
resources_used.walltime=00:00:00
06/14/2019 15:26:35 A user=viner group=viner account=“None” project=_pbs_project_default jobname=ColdForecast.20 queue=workq ctime=1560551194 qtime=1560551194 etime=1560551194
start=1560551194
exec_host=compute-0-13/040+compute-0-14/040+compute-0-15/040+compute-0-16/040+compute-0-17/040+compute-0-8/040+compute-0-18/040+compute-0-19/040+compute-0-
20/040+compute-0-21/040
exec_vnode=(compute-0-13:ncpus=40)+(compute-0-14:ncpus=40)+(compute-0-15:ncpus=40)+(compute-0-16:ncpus=40)+(compute-0-17:ncpus=40)+(compute-0-8:ncpus=40)+(compute-
0-18:ncpus=40)+(compute-0-19:ncpus=40)+(compute-0-20:ncpus=40)+(compute-0-21:ncpus=40)
Resource_List.mpiprocs=400 Resource_List.ncpus=400 Resource_List.nodect=10 Resource_List.place=free Resource_List.select=10:ncpus=40:mpiprocs=40
Resource_List.walltime=01:00:00 session=121410 end=1560551195 Exit_status=1 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=412kb
resources_used.ncpus=400 resources_used.vmem=12864kb resources_used.walltime=00:00:00 run_count=1

[root@smaster1 tmp]#

adarsh · June 16, 2019, 7:48pm

Please note that the exit status is 1 ( which means the application batch command that the user intended to run via the PBS Script failed ) , hence job has exited . In this scenario, could you please test whether you can run the user script without using PBS Pro on the compute nodes and check whether it executes successfully.

Thank you for sharing the information.

You can print the script the user has submitted by running the below command as a root user
printjob -s <jobid>

brownwrap · June 16, 2019, 9:33pm

The script will run, if it is the only job running. that is the root of the problem. The script will run by itself, but not if the other user is running his. Here is the mom log of one of the compute nodes:

06/14/2019 15:26:34;0008;pbs_mom;Job;15170.smaster1;nprocs: 488, cantstat: 0, nomem: 0, skipped: 0, cached: 0, max excluded PID: 0
06/14/2019 15:26:34;0008;pbs_mom;Job;15170.smaster1;Started, pid = 121410
06/14/2019 15:26:34;0080;pbs_mom;Job;15170.smaster1;task 00000001 terminated
06/14/2019 15:26:34;0008;pbs_mom;Job;15170.smaster1;Terminated
06/14/2019 15:26:34;0100;pbs_mom;Job;15170.smaster1;task 00000001 cput= 0:00:00
06/14/2019 15:26:34;0008;pbs_mom;Job;15170.smaster1;kill_job
06/14/2019 15:26:34;0100;pbs_mom;Job;15170.smaster1;compute-0-13 cput= 0:00:00 mem=412kb
06/14/2019 15:26:34;0100;pbs_mom;Job;15170.smaster1;compute-0-14.local cput= 0:00:00 mem=0kb
06/14/2019 15:26:34;0100;pbs_mom;Job;15170.smaster1;compute-0-15.local cput= 0:00:00 mem=0kb
06/14/2019 15:26:34;0100;pbs_mom;Job;15170.smaster1;compute-0-16.local cput= 0:00:00 mem=0kb
06/14/2019 15:26:34;0100;pbs_mom;Job;15170.smaster1;compute-0-17.local cput= 0:00:00 mem=0kb
06/14/2019 15:26:34;0100;pbs_mom;Job;15170.smaster1;compute-0-8.local cput= 0:00:00 mem=0kb
06/14/2019 15:26:34;0100;pbs_mom;Job;15170.smaster1;compute-0-18.local cput= 0:00:00 mem=0kb
06/14/2019 15:26:34;0100;pbs_mom;Job;15170.smaster1;compute-0-19.local cput= 0:00:00 mem=0kb
06/14/2019 15:26:34;0100;pbs_mom;Job;15170.smaster1;compute-0-20.local cput= 0:00:00 mem=0kb
06/14/2019 15:26:34;0100;pbs_mom;Job;15170.smaster1;compute-0-21.local cput= 0:00:00 mem=0kb
06/14/2019 15:26:34;0008;pbs_mom;Job;15170.smaster1;no active tasks
06/14/2019 15:26:34;0100;pbs_mom;Job;15170.smaster1;Obit sent

adarsh · June 17, 2019, 5:49am

Thank you for the mom logs, it does not say anything as job is not using any cput or walltime and is immediately exiting with Exit_Status = 1

Could you please submit the below and share the output:

qsub -l select=20:ncpus=40    --  /bin/sleep 10000
qsub -l select=20:ncpus=40    --  /bin/sleep 10000
qsub -l select=20:ncpus=40    --  /bin/sleep 10000
qstat -answ1
pbsnode -aSj

If this fails, then you have to share us pbs_diag output by running $PBS_EXEC/unsupported/pbs_diag -j <failed jobid> , the output of this command is a tar.gz file stored in the /root/ folder.

brownwrap · June 17, 2019, 1:47pm

This ran as expected:

15223.smaster1 STDIN ramos 00:00:00 R workq
15224.smaster1 STDIN ramos 0 Q workq
15225.smaster1 STDIN ramos 0 Q workq

adarsh · June 18, 2019, 7:21am

Thank you. There are no issues with the cluster configuration. It is to do with the job script that user is submitting, which is failing instantly. If you can share the job script (removing any classifieds) that is failing immediately we can check.

Topic		Replies	Views
Job not getting distributed among nodes Users/Site Administrators	41	3082	June 19, 2022
Jobs Immediately Exiting (Email Spamming) Users/Site Administrators	3	1541	June 28, 2018
Max running Job Users/Site Administrators	12	1202	August 11, 2021
Failed to assign resources to job Users/Site Administrators	9	1695	May 26, 2022
Jobs maybe running in one node, possible reason for getting killed Users/Site Administrators	7	218	July 9, 2024

Schedulers doesn't seem to be holding jobs

Related topics