Running cpu and gpu jobs concurrently

drao · July 17, 2018, 11:08pm

Hi,
I want to schedule a few jobs on a system with GPUs. Some of these jobs require exclusive access to the gpu on the system and some don’t. While the GPU intensive job is running, I want to schedule the CPU job to achieve a higher throughput of jobs. How do I schedule these kinds of jobs concurrently but make sure that two GPU jobs don’t run at the same time?

I have followed “Chapter 5 Allocating Resources & Placing Jobs” of User guide to create ngpus as a resource, I can see these fields in pbsnodes -v of any node
resources_available.arch = windows
resources_available.host = wmep09
resources_available.mem = 50322552kb
resources_available.naccelerators = 2
resources_available.ncpus = 24
resources_available.ngpus = 2
resources_available.vnode = wmep09
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 5
resources_assigned.ngpus = 0
resources_assigned.vmem = 0kb
resv_enable = True

But when I schedule any test with -l select=1:host=wmep09:ncpus=1:ngpus=1, I always see resources_assigned.ngpus=0, it doesnt change to 1. What’s happening here? I have followed the static way to create ngpus resource.

adarsh · July 20, 2018, 8:24am

exclusive access to the node : use -l place=excl
if you would like the shared access : -l place=free or -l place=pack

Until the resources (cpu, gpu, memory, customresources) are free on that compute node. PBS will schedule jobs on to those nodes unless the job has not requested exclusive access to that node.

To make sure TWO gpu jobs does not run on the same host at the same (if in case 2 gpus exist on that node),
you can have a custom resource gpu_count
Case 1:

qmgr -c “create resource gpu_count type=long,flag=nh”
add the “ngpus” resource to the resources: “ncpus, aoe, … , gpu_count” line of the sched_config
kill -HUP
qmgr -c "set node NODENAME resources_available.gpu_count=1 "
qsub -l select=1:ncpus=2:ngpus=1:gpu_count=1

Case 2:
You do not want to run two GPU jobs concurrently, hence reduce the number gpu resource on that node to 1
qmgr -c " set node NODENAME resources_available.ngpus=1 "

qmgr -c “create resource ngpus type=long,flag=hn”
add the “ngpus” resource to the resources: “ncpus, aoe, … , ngpus” line of the sched_config
kill -HUP
now submit the job -l select=1:host=wmep09:ncpus=1:ngpus=1 , when this job is running the resources_assigned.ngpus=1 will be populated.

Hope this helps

drao · July 20, 2018, 8:54pm

Hi, I tried the steps mentioned in case 1, I still dont see it care about gpu resource at all
Also, what is the difference between ngpus and gpu_count?

drao$ qmgr -c "set node wmep13 resources_available.gpu_count=1"
drao$ echo "sleep 3600" | qsub -l select=1:host=wmep13:ncpus=2:ngpus=1:gpu_count=1
14831.sched-win
drao$ pbsnodes -vSjL wmep13
                                                        mem       ncpus   nmics   ngpus
vnode           state           njobs   run   susp      f/t        f/t     f/t     f/t   jobs
--------------- --------------- ------ ----- ------ ------------ ------- ------- ------- -------
wmep13          free                 1     1      0    24gb/24gb   22/24     0/0     1/1 14831     <-ngpus still shows 1/1, but ncpus shows 22/24
drao$ qstat -f 14831
Job Id: 14831.sched-win
    Job_Name = STDIN
    Job_Owner = drao@sched-win.abc.net
    resources_used.cput = 00:00:00
    resources_used.mem = 0kb
    resources_used.walltime = 00:00:07
    job_state = E
    queue = workq
    server = sched-win
    Checkpoint = u
    ctime = Fri Jul 20 13:46:06 2018
    Error_Path = sched-win.abc.net:C:/Users/drao/Documents/PBS Pro/STDIN.e14831 
    exec_host = wmep13/0*2
    exec_vnode = (wmep13:ncpus=2)   <- it doesn't show gpus here
 ...

This is the output of qmgr -c “p s”

# Create resources and set their properties.
#
#
# Create and define resource ngpus
#
create resource ngpus
set resource ngpus type = long
set resource ngpus flag = hn
#
# Create and define resource gpu_count
#
create resource gpu_count
set resource gpu_count type = long
set resource gpu_count flag = hn
#
# Create queues and set their attributes.
#
#
# Create and define queue workq
#
create queue workq
set queue workq queue_type = Execution
set queue workq enabled = True
set queue workq started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = sched-win
set server acl_roots = drao
set server default_queue = workq
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.ncpus = 1
set server default_chunk.ncpus = 1
set server scheduler_iteration = 600
set server flatuid = True
set server resv_enable = True
set server node_fail_requeue = 310
set server max_array_size = 10000
set server single_signon_password_enable = True
set server pbs_license_min = 0
set server pbs_license_max = 2147483647
set server pbs_license_linger_time = 31536000
set server eligible_time_enable = False
set server job_history_enable = True
set server max_concurrent_provision = 5

adarsh · July 21, 2018, 5:04am

Please check and share the output of pbsnodes wmep13, when you run this
echo “sleep 3600” | qsub -l select=1:host=wmep13:ncpus=2:ngpus=1:gpu_count=1

Also, please share the output of qstat -fx and tracejob

resources_available.ngpus =2 ( total number of gpus available on that node )
resources_available.gpu_count=1 ( your requirement was to run only one gpu job per node and no concurrent gpu jobs are allowed , hence this is for concurrency check )

Please share your sched_config file , are you sure you restarted PBS Services or executed kill -HUP <PID of the PBS Scheduler ) , after adding the custom resources to the "resources: " line of the sched_config

My test run shows below:
[root@pbspro ~]# qstat -fx 22877 | grep exec
exec_host = gn001/2
exec_vnode = (gn001:ncpus=1:ngpus=1)

Hope this helps

drao · July 23, 2018, 7:54pm

I restarted PBS services and also rebooted the machine running PBS services after adding the custom resources to the resources line.
This is the output of echo “sleep 3600” | qsub -l select=1:host=wmep13:ncpus=2:ngpus=1:gpu_count=1, qstat -fx and tracejob.

ABC$ echo “sleep 3600” | qsub -l select=1:host=wmep13:ncpus=2:ngpus=1:gpu_count=1
17203.sched-win
ABC$ qstat -fx 17203
Job Id: 17203.sched-win
Job_Name = STDIN
Job_Owner = drao@sched-win.abc.net
resources_used.cput = 00:00:00
resources_used.mem = 0kb
resources_used.walltime = 00:00:07
job_state = E
queue = workq
server = sched-win
Checkpoint = u
ctime = Mon Jul 23 12:43:04 2018
Error_Path = sched-win.abc.net:C:/Users/drao/STDIN.e17203
exec_host = wmep13/0*2
exec_vnode = (wmep13:ncpus=2)
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Mon Jul 23 12:43:05 2018
Output_Path = sched-win.abc.net:C:/Users/drao/STDIN.o17203
Priority = 0
qtime = Mon Jul 23 12:43:04 2018
Rerunable = True
Resource_List.gpu_count = 1
Resource_List.ncpus = 2
Resource_List.ngpus = 1
Resource_List.nodect = 1
Resource_List.place = free
Resource_List.select = 1:host=wmep13:ncpus=2:ngpus=1:gpu_count=1
schedselect = 1:host=wmep13:ncpus=2:ngpus=1:gpu_count=1
stime = Mon Jul 23 12:43:04 2018
session_id = 4368
jobdir = C:/Users/drao/Documents/PBS Pro
substate = 51
Variable_List = PBS_O_HOME=C:/cygwin/home/drao,
PBS_O_PATH=C:/PROGRA~1/ABC/win64/18.5/bin;C:/Program Files/ABC/Microso
ft Open Tools 14/bin/amd64;C:/Program Files (x86)/Windows Kits/10/Windo
ws Kits/10/bin/x64;C:/PROGRA~1/ABC/win64/2018/cuda/9.2/bin;C:/PROGRA~1/
ABC/win64/2018/cuda/9.1/bin;C:/PROGRA~1/ABC/win64/2018/cuda/9.0/bin;C:/
PROGRA~1/ABC/win64/2018/cuda/8.0/bin;C:/cygwin/bin;C:/Program Files/ABC
/flexlm;C:/Program Files/Java/jre1.8.0_112/bin;C:/Python27;C:/Python27/
Scripts;C:/Perl/bin;C:/ProgramData/Oracle/Java/javapath;C:/Program File
s/Microsoft HPC Pack 2012/Bin;C:/Windows/system32;C:/Windows;C:/Windows
/System32/Wbem;C:/Windows/System32/WindowsPowerShell/v1.0;C:/Program Fi
les (x86)/PBS/exec/bin;C:/Program Files (x86)/PBS/exec/sbin;C:/Program
Files (x86)/PBS/exec/lib;C:/Program Files (x86)/PBS/exec/pgsql/bin;C:/P
rogram Files (x86)/PBS/exec/python;C:/Program Files (x86)/PBS/exec/pyth
on/Scripts;C:/Program Files (x86)/PBS/exec/python/Lib/site-packages/pyw
in32_system32;C:/Program Files (x86)/Windows Kits/10/Windows Performanc
e Toolkit;C:/Users/drao/AppData/Local/Microsoft/WindowsApps;.,
PBS_O_WORKDIR=C:/Users/drao,PBS_O_SYSTEM=VER_PLATFORM_WIN32_NT,
PBS_O_QUEUE=workq,PBS_O_HOST=sched-win.abc.net
euser = drao
egroup = Users
hashname = 17203.sched-win
queue_rank = -928340289
queue_type = E
comment = Job run at Mon Jul 23 at 12:43 on (wmep13:ncpus=2)
alt_id = HomeDirectory=C:\Users\drao\Documents\PBS Pro
etime = Mon Jul 23 12:43:04 2018
run_count = 1
Exit_status = 1
Submit_arguments = -l select=1:host=wmep13:ncpus=2:ngpus=1:gpu_count=1
project = _pbs_project_default
run_version = 1

ABC$ tracejob 17203

Job: 17203.sched-win

07/23/2018 12:43:04 S Job Queued at request of drao@sched-win.abc.net, owner = drao@sched-win.abc.net, job name = STDIN, queue = workq
07/23/2018 12:43:04 S Job Run at request of Scheduler@sched-win.abc.net on exec_vnode (wmep13:ncpus=2)
07/23/2018 12:43:04 L Considering job to run
07/23/2018 12:43:04 L Job run
07/23/2018 12:43:05 S Obit received momhop:1 serverhop:1 state:4 substate:42

sched_config looks like this (I have removed all the comments to keep it short)

round_robin: False all
by_queue: True prime
by_queue: True non_prime
strict_ordering: false ALL
help_starving_jobs: true ALL
max_starve: 24:00:00
backfill_prime: false ALL
prime_exempt_anytime_queues: false
primetime_prefix: p_
nonprimetime_prefix: np_
node_sort_key: “sort_priority HIGH” ALL
provision_policy: “aggressive_provision”
sort_queues: true ALL
resources: “ncpus, mem, arch, host, vnode, aoe, eoe, ngpus, gpu_count”
load_balancing: false ALL
smp_cluster_dist: pack
fair_share: false ALL
fairshare_usage_res: cput
fairshare_entity: euser
fairshare_decay_time: 24:00:00
fairshare_decay_factor: 0.5
preemptive_sched: true ALL
preempt_queue_prio: 150
preempt_prio: “express_queue, normal_jobs”
preempt_order: “SCR”
preempt_sort: min_time_since_start
dedicated_prefix: ded
log_filter: 3328

adarsh · July 23, 2018, 8:00pm

Please try to submit as below and share the qstat -fx
echo “sleep 3600” | qsub -l select=1:ncpus=2:ngpus=1:gpu_count=1
qstat -fx
pbsnodes wmep13

Thank you

drao · July 24, 2018, 6:06pm

ABC$ echo "sleep 3600" | qsub -l select=1:ncpus=2:ngpus=1:gpu_count=1:host=wmep13
19352.sched-win
ABC$ tracejob 19352

Job: 19352.sched-win

07/24/2018 11:01:02  S    Job Queued at request of drao@sched-win.abc.net, owner = drao@sched-win.abc.net, job name = STDIN,
                          queue = workq
07/24/2018 11:01:03  S    Job Run at request of Scheduler@sched-win.abc.net on exec_vnode (wmep13:ncpus=2)
07/24/2018 11:01:03  S    Obit received momhop:1 serverhop:1 state:4 substate:42
07/24/2018 11:01:03  L    Considering job to run
07/24/2018 11:01:03  L    Job run
ABC$ pbsnodes wmep13
wmep13
     Mom = wmep13.abc.net
     Port = 15002
     pbs_version = 18.1.0
     ntype = PBS
     state = free
     pcpus = 24
     jobs = 19352.sched-win/0, 19352.sched-win/1
     resources_available.arch = windows
     resources_available.gpu_count = 1
     resources_available.host = wmep13
     resources_available.mem = 25156728kb
     resources_available.ncpus = 24
     resources_available.ngpus = 1
     resources_available.vnode = wmep13
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.gpu_count = 0
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 2
     resources_assigned.ngpus = 0
     resources_assigned.vmem = 0kb
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Fri Jul 20 10:58:58 2018
     last_used_time = Mon Jul 23 13:15:46 2018

ABC$ qstat -fx 19352
Job Id: 19352.sched-win
    Job_Name = STDIN
    Job_Owner = drao@sched-win.abc.net
    resources_used.cput = 00:00:00
    resources_used.mem = 0kb
    resources_used.walltime = 00:00:09
    job_state = E
    queue = workq
    server = sched-win
    Checkpoint = u
    ctime = Tue Jul 24 11:01:02 2018
    Error_Path = sched-win.abc.net:C:/Users/drao/STDIN.e19352
    exec_host = wmep13/0*2
    exec_vnode = (wmep13:ncpus=2)
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Tue Jul 24 11:01:03 2018
    Output_Path = sched-win.abc.net:C:/Users/drao/STDIN.o19352
    Priority = 0
    qtime = Tue Jul 24 11:01:02 2018
    Rerunable = True
    Resource_List.gpu_count = 1
    Resource_List.ncpus = 2
    Resource_List.ngpus = 1
    Resource_List.nodect = 1
    Resource_List.place = free
    Resource_List.select = 1:ncpus=2:ngpus=1:gpu_count=1:host=wmep13
    schedselect = 1:ncpus=2:ngpus=1:gpu_count=1:host=wmep13
    stime = Tue Jul 24 11:01:03 2018
    session_id = 4184
    jobdir = C:/Users/drao/Documents/PBS Pro
    substate = 51
    Variable_List = PBS_O_HOME=C:/cygwin/home/drao,
        PBS_O_PATH=C:/PROGRA~1/ABC/win64/18.5/bin;C:/Program Files/ABC/Microso
        ft Open Tools 14/bin/amd64;C:/Program Files (x86)/Windows Kits/10/Windo
        ws Kits/10/bin/x64;C:/PROGRA~1/ABC/win64/2018/cuda/9.2/bin;C:/PROGRA~1/
        ABC/win64/2018/cuda/9.1/bin;C:/PROGRA~1/ABC/win64/2018/cuda/9.0/bin;C:/
        PROGRA~1/ABC/win64/2018/cuda/8.0/bin;C:/cygwin/bin;C:/Program Files/ABC
        /flexlm;C:/Program Files/Java/jre1.8.0_112/bin;C:/Python27;C:/Python27/
        Scripts;C:/Perl/bin;C:/ProgramData/Oracle/Java/javapath;C:/Program File
        s/Microsoft HPC Pack 2012/Bin;C:/Windows/system32;C:/Windows;C:/Windows
        /System32/Wbem;C:/Windows/System32/WindowsPowerShell/v1.0;C:/Program Fi
        les (x86)/PBS/exec/bin;C:/Program Files (x86)/PBS/exec/sbin;C:/Program
        Files (x86)/PBS/exec/lib;C:/Program Files (x86)/PBS/exec/pgsql/bin;C:/P
        rogram Files (x86)/PBS/exec/python;C:/Program Files (x86)/PBS/exec/pyth
        on/Scripts;C:/Program Files (x86)/PBS/exec/python/Lib/site-packages/pyw
        in32_system32;C:/Program Files (x86)/Windows Kits/10/Windows Performanc
        e Toolkit;C:/Users/drao/AppData/Local/Microsoft/WindowsApps;.,
        PBS_O_WORKDIR=C:/Users/drao,PBS_O_SYSTEM=VER_PLATFORM_WIN32_NT,
        PBS_O_QUEUE=workq,PBS_O_HOST=sched-win.abc.net
    euser = drao
    egroup = Users
    hashname = 19352.sched-win
    queue_rank = -848062255
    queue_type = E
    comment = Job run at Tue Jul 24 at 11:01 on (wmep13:ncpus=2)
    alt_id = HomeDirectory=C:\Users\drao\Documents\PBS Pro
    etime = Tue Jul 24 11:01:02 2018
    run_count = 1
    Exit_status = 1
    Submit_arguments = -l select=1:ncpus=2:ngpus=1:gpu_count=1:host=wmep13
    project = _pbs_project_default
    run_version = 1

adarsh · July 25, 2018, 8:47am

My apologies, it is windows system ( your job has exited with non zero ), please use the below qsub request.
qsub -l select=1:ncpus=2:ngpus=1:gpu_count=1 – pbs_sleep 1000

Note: sleep is not available on windows, hence we have to use the pbs_sleep command available in the BIN directory of PBS Pro deployment.

drao · July 25, 2018, 4:58pm

Please find the output of qstat -fx, tracejob and pbsnodes for wmep13 after submitting the job.

ABC$ echo "c:\Program Files (x86)\PBS\exec\bin\pbs-sleep.exe 1000" | qsub  -l select=1:host=wmep13:ncpus=2:ngpus=1:gpu_count=1
25838.sched-win
ABC$ qstat -fx 25838
Job Id: 25838.sched-win
    Job_Name = STDIN
    Job_Owner = drao@sched-win.ABC.net
    job_state = R
    queue = workq
    server = sched-win
    Checkpoint = u
    ctime = Wed Jul 25 09:55:00 2018
    Error_Path = sched-win.ABC.net:C:/Program Files (x86)/PBS/exec/STDIN.e25838

    exec_host = wmep13/0*2
    exec_vnode = (wmep13:ncpus=2)
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Wed Jul 25 09:55:06 2018
    Output_Path = sched-win.ABC.net:C:/Program Files (x86)/PBS/exec/STDIN.o2583
        8
    Priority = 0
    qtime = Wed Jul 25 09:55:00 2018
    Rerunable = True
    Resource_List.gpu_count = 1
    Resource_List.ncpus = 2
    Resource_List.ngpus = 1
    Resource_List.nodect = 1
    Resource_List.place = free
    Resource_List.select = 1:host=wmep13:ncpus=2:ngpus=1:gpu_count=1
    schedselect = 1:host=wmep13:ncpus=2:ngpus=1:gpu_count=1
    substate = 41
    Variable_List = PBS_O_HOME=C:/cygwin/home/drao,
        PBS_O_PATH=C:/PROGRA~1/ABC/win64/18.5/bin;C:/Program Files/ABC/Microso
        ft Open Tools 14/bin/amd64;C:/Program Files (x86)/Windows Kits/10/Windo
        ws Kits/10/bin/x64;C:/PROGRA~1/ABC/win64/2018/cuda/9.2/bin;C:/PROGRA~1/
        ABC/win64/2018/cuda/9.1/bin;C:/PROGRA~1/ABC/win64/2018/cuda/9.0/bin;C:/
        PROGRA~1/ABC/win64/2018/cuda/8.0/bin;C:/cygwin/bin;C:/Program Files/ABC
        /flexlm;C:/Program Files/Java/jre1.8.0_112/bin;C:/Python27;C:/Python27/
        Scripts;C:/Perl/bin;C:/ProgramData/Oracle/Java/javapath;C:/Program File
        s/Microsoft HPC Pack 2012/Bin;C:/Windows/system32;C:/Windows;C:/Windows
        /System32/Wbem;C:/Windows/System32/WindowsPowerShell/v1.0;C:/Program Fi
        les (x86)/PBS/exec/bin;C:/Program Files (x86)/PBS/exec/sbin;C:/Program
        Files (x86)/PBS/exec/lib;C:/Program Files (x86)/PBS/exec/pgsql/bin;C:/P
        rogram Files (x86)/PBS/exec/python;C:/Program Files (x86)/PBS/exec/pyth
        on/Scripts;C:/Program Files (x86)/PBS/exec/python/Lib/site-packages/pyw
        in32_system32;C:/Program Files (x86)/Windows Kits/10/Windows Performanc
        e Toolkit;C:/Users/drao/AppData/Local/Microsoft/WindowsApps;.,
        PBS_O_WORKDIR=C:/Program Files (x86)/PBS/exec,
        PBS_O_SYSTEM=VER_PLATFORM_WIN32_NT,PBS_O_QUEUE=workq,
        PBS_O_HOST=sched-win.ABC.net
    euser = drao
    egroup = Users
    hashname = 25838.sched-win
    queue_rank = -765623806
    queue_type = E
    comment = Job was sent for execution at Wed Jul 25 at 09:55 on (wmep13:ncpu
        s=2)
    etime = Wed Jul 25 09:55:00 2018
    run_count = 1
    Submit_arguments = -l select=1:host=wmep13:ncpus=2:ngpus=1:gpu_count=1
    project = _pbs_project_default
    run_version = 1

ABC$ tracejob 25838

Job: 25838.sched-win

07/25/2018 09:55:00  S    Job Queued at request of drao@sched-win.ABC.net, owner = drao@sched-win.ABC.net, job name = STDIN,
                          queue = workq
07/25/2018 09:55:06  S    Job Run at request of Scheduler@sched-win.ABC.net on exec_vnode (wmep13:ncpus=2)
07/25/2018 09:55:06  L    Considering job to run
07/25/2018 09:55:06  L    Job run
07/25/2018 09:55:21  S    Obit received momhop:1 serverhop:1 state:4 substate:42
ABC$ pbsnodes wmep13
wmep13
     Mom = wmep13.ABC.net
     Port = 15002
     pbs_version = 18.1.0
     ntype = PBS
     state = free
     pcpus = 24
     jobs = 25838.sched-win/0, 25838.sched-win/1
     resources_available.arch = windows
     resources_available.gpu_count = 1
     resources_available.host = wmep13
     resources_available.mem = 25156728kb
     resources_available.ncpus = 24
     resources_available.ngpus = 1
     resources_available.vnode = wmep13
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.gpu_count = 0
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 2
     resources_assigned.ngpus = 0
     resources_assigned.vmem = 0kb
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Tue Jul 24 16:26:04 2018
     last_used_time = Tue Jul 24 11:01:56 2018

adarsh · July 25, 2018, 8:43pm

Thank you drao for your patience and persistence.
I am not seeing this behavior on the Linux system with the exact same configuration.
I would have to check with @hirenvadalia and @agrawalravi90 on this to get their comments.

Also, request you to try the qsub command without using host=wmep13 in the chunk statement:

echo “c:\Program Files (x86)\PBS\exec\bin\pbs-sleep.exe 1000” | qsub -l select=1:ncpus=2:ngpus=1:gpu_count=1

hirenvadalia · July 26, 2018, 4:42am

@drao and @adarsh
I just tested on windows and linux using pbspro master branch and its working fine for me, I can see ngpus is listed in exec_vnode in qstat -f as well as resources_assigned.ngpus is also getting updated in pbsnodes -av.

Linux:

[pbsadmin@testdev ~]$ qsub -l select=1:host=testdev:ncpus=2:ngpus=1 -- /bin/sleep 1000
1.testdev
[pbsadmin@testdev ~]$ qstat -f
Job Id: 1.testdev
    Job_Name = STDIN
    Job_Owner = pbsadmin@testdev.mishti.com
    resources_used.cpupercent = 0
    resources_used.cput = 00:00:00
    resources_used.mem = 0kb
    resources_used.ncpus = 2
    resources_used.vmem = 0kb
    resources_used.walltime = 00:00:00
    job_state = R
    queue = workq
    server = testdev
    Checkpoint = u
    ctime = Thu Jul 26 04:28:06 2018
    Error_Path = testdev.mishti.com:/home/pbsadmin/STDIN.e1
    exec_host = testdev/0*2
    exec_vnode = (testdev:ncpus=2:ngpus=1)
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Thu Jul 26 04:28:06 2018
    Output_Path = testdev.mishti.com:/home/pbsadmin/STDIN.o1
    Priority = 0
    qtime = Thu Jul 26 04:28:06 2018
    Rerunable = True
    Resource_List.ncpus = 2
    Resource_List.ngpus = 1
    Resource_List.nodect = 1
    Resource_List.place = free
    Resource_List.select = 1:host=testdev:ncpus=2:ngpus=1
    stime = Thu Jul 26 04:28:06 2018
    session_id = 37883
    jobdir = /home/pbsadmin
    substate = 42
    Variable_List = PBS_O_HOME=/home/pbsadmin,PBS_O_LANG=en_US.UTF-8,
	PBS_O_LOGNAME=pbsadmin,
	PBS_O_PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt
	/pbs/bin:/home/pbsadmin/.local/bin:/home/pbsadmin/bin,
	PBS_O_MAIL=/var/spool/mail/pbsadmin,PBS_O_SHELL=/bin/bash,
	PBS_O_WORKDIR=/home/pbsadmin,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=workq,
	PBS_O_HOST=testdev.mishti.com
    comment = Job run at Thu Jul 26 at 04:28 on (testdev:ncpus=2:ngpus=1)
    etime = Thu Jul 26 04:28:06 2018
    run_count = 1
    Submit_arguments = -l select=1:host=testdev:ncpus=2:ngpus=1 -- /bin/sleep 1
	000
    executable = <jsdl-hpcpa:Executable>/bin/sleep</jsdl-hpcpa:Executable>
    argument_list = <jsdl-hpcpa:Argument>1000</jsdl-hpcpa:Argument>
    project = _pbs_project_default

[pbsadmin@testdev ~]$ pbsnodes -av
testdev
     Mom = testdev.mishti.com
     ntype = PBS
     state = free
     pcpus = 8
     jobs = 1.testdev/0, 1.testdev/1
     resources_available.arch = linux
     resources_available.host = testdev
     resources_available.mem = 16249868kb
     resources_available.ncpus = 8
     resources_available.ngpus = 1
     resources_available.vnode = testdev
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 2
     resources_assigned.ngpus = 1
     resources_assigned.vmem = 0kb
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Thu Jul 26 04:25:06 2018
     last_used_time = Thu Jul 26 04:27:33 2018

[pbsadmin@testdev ~]$ qstat --version
pbs_version = 18.1.0

Windows:

C:\Users\pbsadmin>qsub -l select=1:host=winbuild:ncpus=2:ngpus=1 -- pbs-
sleep 1000
13.winbuild

C:\Users\pbsadmin>qstat -f
Job Id: 13.winbuild
    Job_Name = STDIN
    Job_Owner = pbsadmin@winbuild.mishti.com
    resources_used.cput = 00:00:00
    resources_used.mem = 0kb
    resources_used.walltime = 00:00:00
    job_state = R
    queue = workq
    server = winbuild
    Checkpoint = u
    ctime = Wed Jul 25 21:31:35 2018
    Error_Path = winbuild.mishti.com:C:/Users/pbsadmin/STDIN.e13
    exec_host = winbuild/0*2
    exec_vnode = (winbuild:ncpus=2:ngpus=1)
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Wed Jul 25 21:31:35 2018
    Output_Path = winbuild.mishti.com:C:/Users/pbsadmin/STDIN.o13

    Priority = 0
    qtime = Wed Jul 25 21:31:35 2018
    Rerunable = True
    Resource_List.ncpus = 2
    Resource_List.ngpus = 1
    Resource_List.nodect = 1
    Resource_List.place = free
    Resource_List.select = 1:host=winbuild:ncpus=2:ngpus=1
    schedselect = 1:host=winbuild:ncpus=2:ngpus=1
    stime = Wed Jul 25 21:31:35 2018
    session_id = 157180
    jobdir = C:/Users/pbsadmin/Documents/PBS Pro
    substate = 42
    Variable_List = PBS_O_HOME=/,
        PBS_O_PATH=C:/Windows/system32;C:/Windows;C:/Windows/System32/Wbem;C:/
        Windows/System32/WindowsPowerShell/v1.0/;C:/Program Files/UNICOM System
        s/PurifyPlus;C:/Program Files (x86)/PBS/exec/bin;C:/Program Files (x86)
        /PBS/exec/sbin;C:/Program Files (x86)/PBS/exec/lib;C:/Program Files (x8
        6)/PBS/exec/pgsql/bin;C:/Program Files (x86)/PBS/exec/python;C:/Program
         Files (x86)/PBS/exec/python/Scripts;C:/Program Files (x86)/PBS/exec/py
        thon/Lib/site-packages/pywin32_system32,
        PBS_O_WORKDIR=C:/Users/pbsadmin,PBS_O_SYSTEM=VER_PLATFORM_WIN32_NT,
        PBS_O_QUEUE=workq,PBS_O_HOST=winbuild.mishti.com
    euser = pbsadmin
    egroup = Administrators
    hashname = 13.winbuild
    queue_rank = -723829189
    queue_type = E
    comment = Job run at Wed Jul 25 at 21:31 on (winbuild:ncpus=2:ngpus=1
        )
    alt_id = HomeDirectory=C:\Users\pbsadmin\Documents\PBS Pro
    etime = Wed Jul 25 21:31:35 2018
    run_count = 1
    Submit_arguments = -l select=1:host=winbuild:ncpus=2:ngpus=1 -- pbs-s
        leep 1000
    executable = <jsdl-hpcpa:Executable>pbs-sleep</jsdl-hpcpa:Executable>
    argument_list = <jsdl-hpcpa:Argument>1000</jsdl-hpcpa:Argument>
    project = _pbs_project_default
    run_version = 1


C:\Users\pbsadmin>pbsnodes -av
winbuild
     Mom = winbuild.mishti.com
     Port = 15002
     pbs_version = 18.1.0
     ntype = PBS
     state = free
     pcpus = 8
     jobs = 13.winbuild/0, 13.winbuild/1
     resources_available.arch = windows
     resources_available.host = winbuild
     resources_available.mem = 33549936kb
     resources_available.ncpus = 8
     resources_available.ngpus = 1
     resources_available.vnode = winbuild
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 2
     resources_assigned.ngpus = 1
     resources_assigned.vmem = 0kb
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Wed Jul 25 21:05:39 2018
     last_used_time = Wed Jul 25 21:30:33 2018


C:\Users\pbsadmin>

adarsh · July 26, 2018, 5:45am

Thank you @hirenvadalia much appreciated.

Topic		Replies	Views
PBS Single exection host run job using cpu include gpu Users/Site Administrators	5	1817	May 8, 2021
GPU memory as a custom resource Users/Site Administrators	6	3136	January 15, 2018
Job not getting distributed among nodes Users/Site Administrators	41	3133	June 19, 2022
Schedulers doesn't seem to be holding jobs Users/Site Administrators	11	1639	June 18, 2019
How to a CPU queue and GPU queue and hoe to utilising it Developers	4	2568	July 20, 2018

Running cpu and gpu jobs concurrently

Related topics