Current Torque user, looking at using PBSPro as part of the OpenHPC stack on a new HPC. I have PBSPro 14.1.2 installed on head node with 4 separate compute nodes.
I don’t typically run parallel processing jobs, but run many separate monte carlo iterations concurrently, one job per core. When attempting to run jobs this way with PBSPro, I am having a problem. Everything appears ok when I use the monitoring tools (please see output below), i.e. the queue is receiving jobs, jobs are runnning, each job appears to be allocated to a separate cpu, output is produced, etc. However, my job sets are taking a very long time to run. Looking into the long run time, I realized that if I ssh to a compute node and use the
top command to monitor cpu usage, all jobs are running on only 1 cpu, serially. (all other cpus are at 0%)
I’m not sure if this is a problem with my PBS setup or perhaps the way I am requesting resources? I have tried many combinations but currently I have settled on
#PBS -l select=1:ncpus=1 to run jobs. I have also tried adding
-l place=scatter thinking it could be a placement issue, but that didn’t help. I’m confused since PBSPro seemingly sees all 48 cpus per node and reports that it has allocated 12 cpus (1 for each of 12 jobs in this example case), but is really only running the jobs on 1 cpu.
the following outputs are from a case where I am trying to run 12 of these 1 core jobs concurrently on one node using the default workq:
Output of qstat -Q:
Queue Max Tot Ena Str Que Run Hld Wat Trn Ext Type ---------------- ----- ----- --- --- ----- ----- ----- ----- ----- ----- ---- workq 0 12 yes yes 0 12 0 0 0 0 Exec
Output of qstat -fB:
Server: athena server_state = Active server_host = athena scheduling = True total_jobs = 5862 state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:12 Exiting:0 Begu n:0 default_queue = workq log_events = 511 mail_from = adm query_other_jobs = True resources_default.ncpus = 1 resources_default.place = scatter default_chunk.ncpus = 1 resources_assigned.ncpus = 12 resources_assigned.nodect = 12 scheduler_iteration = 600 FLicenses = 2000000 resv_enable = True node_fail_requeue = 310 max_array_size = 10000 default_qsub_arguments = -V pbs_license_min = 0 pbs_license_max = 2147483647 pbs_license_linger_time = 31536000 license_count = Avail_Global:1000000 Avail_Local:1000000 Used:0 High_Use:0 Avail_Sockets:1000000 Unused_Sockets:1000000 pbs_version = 14.1.2 eligible_time_enable = False job_history_enable = True max_concurrent_provision = 5
Could this be related to my resource/vnode definitions? I simply created nodes with “create node n001” in qmgr.
Output of pbsnodes n001:
n001 Mom = n001.localdomain Port = 15002 pbs_version = 14.1.2 ntype = PBS state = free pcpus = 48 jobs = 5891.athena/0, 5892.athena/1, 5893.athena/2, 5894.athena/3, 5895.athena/4, 5896.athena/5, 5897.athena/6, 5898.athena/7, 5899.athena/8, 5900.athena/9, 5901.athena/10, 5902.athena/11 resources_available.arch = linux resources_available.host = n001 resources_available.mem = 528278028kb resources_available.ncpus = 48 resources_available.vnode = n001 resources_assigned.accelerator_memory = 0kb resources_assigned.mem = 0kb resources_assigned.naccelerators = 0 resources_assigned.ncpus = 12 resources_assigned.netwins = 0 resources_assigned.vmem = 0kb resv_enable = True sharing = default_shared
I would be grateful for any insight into this, Thanks.