Hello,
Current Torque user, looking at using PBSPro as part of the OpenHPC stack on a new HPC. I have PBSPro 14.1.2 installed on head node with 4 separate compute nodes.
I don’t typically run parallel processing jobs, but run many separate monte carlo iterations concurrently, one job per core. When attempting to run jobs this way with PBSPro, I am having a problem. Everything appears ok when I use the monitoring tools (please see output below), i.e. the queue is receiving jobs, jobs are runnning, each job appears to be allocated to a separate cpu, output is produced, etc. However, my job sets are taking a very long time to run. Looking into the long run time, I realized that if I ssh to a compute node and use the top
command to monitor cpu usage, all jobs are running on only 1 cpu, serially. (all other cpus are at 0%)
I’m not sure if this is a problem with my PBS setup or perhaps the way I am requesting resources? I have tried many combinations but currently I have settled on #PBS -l select=1:ncpus=1
to run jobs. I have also tried adding -l place=scatter
thinking it could be a placement issue, but that didn’t help. I’m confused since PBSPro seemingly sees all 48 cpus per node and reports that it has allocated 12 cpus (1 for each of 12 jobs in this example case), but is really only running the jobs on 1 cpu.
the following outputs are from a case where I am trying to run 12 of these 1 core jobs concurrently on one node using the default workq:
Output of qstat -Q:
Queue Max Tot Ena Str Que Run Hld Wat Trn Ext Type
---------------- ----- ----- --- --- ----- ----- ----- ----- ----- ----- ----
workq 0 12 yes yes 0 12 0 0 0 0 Exec
Output of qstat -fB:
Server: athena
server_state = Active
server_host = athena
scheduling = True
total_jobs = 5862
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:12 Exiting:0 Begu
n:0
default_queue = workq
log_events = 511
mail_from = adm
query_other_jobs = True
resources_default.ncpus = 1
resources_default.place = scatter
default_chunk.ncpus = 1
resources_assigned.ncpus = 12
resources_assigned.nodect = 12
scheduler_iteration = 600
FLicenses = 2000000
resv_enable = True
node_fail_requeue = 310
max_array_size = 10000
default_qsub_arguments = -V
pbs_license_min = 0
pbs_license_max = 2147483647
pbs_license_linger_time = 31536000
license_count = Avail_Global:1000000 Avail_Local:1000000 Used:0 High_Use:0
Avail_Sockets:1000000 Unused_Sockets:1000000
pbs_version = 14.1.2
eligible_time_enable = False
job_history_enable = True
max_concurrent_provision = 5
Could this be related to my resource/vnode definitions? I simply created nodes with “create node n001” in qmgr.
Output of pbsnodes n001:
n001
Mom = n001.localdomain
Port = 15002
pbs_version = 14.1.2
ntype = PBS
state = free
pcpus = 48
jobs = 5891.athena/0, 5892.athena/1, 5893.athena/2, 5894.athena/3,
5895.athena/4, 5896.athena/5, 5897.athena/6, 5898.athena/7, 5899.athena/8,
5900.athena/9, 5901.athena/10, 5902.athena/11
resources_available.arch = linux
resources_available.host = n001
resources_available.mem = 528278028kb
resources_available.ncpus = 48
resources_available.vnode = n001
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 12
resources_assigned.netwins = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
I would be grateful for any insight into this, Thanks.