Hi Jon,
Thank you for your kind advice.
Following your advice I’ve tried to use pbs_cgroups hook in addition to my current configuration.
As a result, in short, it did not work.
After setting up the pbs_cgroups hook (config shown below), I did “qsub ./foo.sh -lselect=ngpus=1” then the job did not run, just stayed in the queue forever.
Honestly speaking, I’m wondering why cgroups could be a solution to my problem.
What I want to do is not to separate an assigned resource from other jobs but to confine an assignment of a (single kind of) consumable resource to a single vnode.
I would be happy if you could give me a concrete idea to cofigure pbs_cgroups for my purpose.
Any workarounds other than pbs_cgroups would also be appreciated.
My current configuration is as follows (in addition to those in the first post):
# qmgr -c "list hook"
Hook pbs_cgroups
type = site
enabled = true
event = execjob_begin,execjob_epilogue,execjob_end,execjob_launch,
execjob_attach,
exechost_periodic,
exechost_startup
user = pbsadmin
alarm = 90
freq = 120
order = 100
debug = false
fail_action = offline_vnodes
# cat /var/spool/pbs/server_priv/hooks/pbs_cgroups.CF
{
"cgroup_prefix" : "pbspro",
"exclude_hosts" : [],
"exclude_vntypes" : ["no_cgroups"],
"run_only_on_hosts" : [],
"periodic_resc_update" : true,
"vnode_per_numa_node" : false,
"online_offlined_nodes" : true,
"use_hyperthreads" : true,
"cgroup" : {
"cpuacct" : {
"enabled" : false,
"exclude_hosts" : [],
"exclude_vntypes" : []
},
"cpuset" : {
"enabled" : false,
"exclude_hosts" : [],
"exclude_vntypes" : []
},
"devices" : {
"enabled" : true,
"exclude_hosts" : [],
"exclude_vntypes" : [],
"allow" : [
"b *:* rwm",
"c *:* rwm",
["nvidiactl", "rwm", "*"],
["nvidia-uvm", "rwm"]
]
},
"hugetlb" : {
"enabled" : false,
"exclude_hosts" : [],
"exclude_vntypes" : [],
"default" : "0MB",
"reserve_percent" : "0",
"reserve_amount" : "0MB"
},
"memory" : {
"enabled" : false,
"exclude_hosts" : [],
"exclude_vntypes" : [],
"soft_limit" : false,
"default" : "256MB",
"reserve_percent" : "0",
"reserve_amount" : "1GB"
},
"memsw" : {
"enabled" : false,
"exclude_hosts" : [],
"exclude_vntypes" : [],
"default" : "256MB",
"reserve_percent" : "0",
"reserve_amount" : "1GB"
}
}
}
Mom’s log after doing “qsub ./foo.sh -lselect=ngpus=1” looks like:
cat /var/spool/pbs/mom_logs/20171011
10/11/2017 18:46:27;0002;pbs_mom;Svr;Log;Log opened
10/11/2017 18:46:27;0002;pbs_mom;Svr;pbs_mom;pbs_version=17.1.0
10/11/2017 18:46:27;0002;pbs_mom;Svr;pbs_mom;pbs_build=mach=N/A:security=N/A:configure_args=N/A
10/11/2017 18:46:27;0002;pbs_mom;Svr;pbs_mom;hostname=host01;pbs_leaf_name=N/A;pbs_mom_node_name=N/A
10/11/2017 18:46:27;0002;pbs_mom;Svr;pbs_mom;ipv4 interface lo: localhost localhost.localdomain localhost4
localhost4.localdomain4
10/11/2017 18:46:27;0002;pbs_mom;Svr;pbs_mom;ipv4 interface bond0: host01
10/11/2017 18:46:27;0002;pbs_mom;Svr;pbs_mom;ipv6 interface lo: localhost localhost.localdomain localhost6
localhost6.localdomain6
10/11/2017 18:46:27;0100;pbs_mom;Svr;parse_config;file config
10/11/2017 18:46:27;0002;pbs_mom;Svr;pbs_mom;Adding IP address ***.***.***.*** as authorized
10/11/2017 18:46:27;0002;pbs_mom;n/a;set_restrict_user_maxsys;setting 499
10/11/2017 18:46:27;0100;pbs_mom;Svr;parse_config;file /var/spool/pbs/mom_priv/config.d/host01.vnodes
10/11/2017 18:46:27;0002;pbs_mom;n/a;read_config;max_check_poll = 120, min_check_poll = 10
10/11/2017 18:46:27;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP set to use reserved port authentication
10/11/2017 18:46:27;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Initializing TPP transport Layer
10/11/2017 18:46:27;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Max files allowed = 204800
10/11/2017 18:46:27;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP initialization done
10/11/2017 18:46:27;0c06;pbs_mom;TPP;pbs_mom(Main Thread);Single pbs_comm configured, TPP Fault tolerant m
ode disabled
10/11/2017 18:46:27;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Connecting to pbs_comm host01
10/11/2017 18:46:27;0002;pbs_mom;Svr;pbs_mom;Adding IP address 127.0.0.1 as authorized
10/11/2017 18:46:27;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address ***.***.***.***:15003 to pbs_comm
10/11/2017 18:46:27;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm host01
10/11/2017 18:46:27;0002;pbs_mom;Svr;set_checkpoint_path;Using default checkpoint path.
10/11/2017 18:46:27;0002;pbs_mom;Svr;set_checkpoint_path;Setting checkpoint path to /var/spool/pbs/checkpo
int/
10/11/2017 18:46:27;0086;pbs_mom;Svr;pbs_mom;Found hook pbs_cgroups type=site
10/11/2017 18:46:27;0086;pbs_mom;Svr;pbs_mom;Found hook PBS_alps_inventory_check type=pbs
10/11/2017 18:46:27;0086;pbs_mom;Svr;pbs_mom;Found hook PBS_power type=pbs
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;ALLHOOKS hook[0] = {pbs_cgroups, order=100, type=0, enabl
ed=1 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_launch,e
xecjob_attach,exechost_periodic,exechost_startup), alarm=90, freq=120}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;ALLHOOKS hook[1] = {PBS_alps_inventory_check, order=1, ty
pe=1, enabled=0 user=0, debug=(0) fail_action=(1), event=(exechost_periodic), alarm=90, freq=300}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;ALLHOOKS hook[2] = {PBS_power, order=2000, type=1, enable
d=0 user=0, debug=(0) fail_action=(1), event=(execjob_begin,execjob_prologue,execjob_epilogue,execjob_end,
exechost_periodic,exechost_startup), alarm=180, freq=300}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;execjob_begin hook[0] = {pbs_cgroups, order=100, type=0,
enabled=1 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_lau
nch,execjob_attach,exechost_periodic,exechost_startup), alarm=90, freq=120}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;execjob_begin hook[1] = {PBS_power, order=2000, type=1, e
nabled=0 user=0, debug=(0) fail_action=(1), event=(execjob_begin,execjob_prologue,execjob_epilogue,execjob
_end,exechost_periodic,exechost_startup), alarm=180, freq=300}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;execjob_prologue hook[0] = {PBS_power, order=2000, type=1
, enabled=0 user=0, debug=(0) fail_action=(1), event=(execjob_begin,execjob_prologue,execjob_epilogue,exec
job_end,exechost_periodic,exechost_startup), alarm=180, freq=300}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;execjob_launch hook[0] = {pbs_cgroups, order=100, type=0,
enabled=1 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_la
unch,execjob_attach,exechost_periodic,exechost_startup), alarm=90, freq=120}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;execjob_epilogue hook[0] = {pbs_cgroups, order=100, type=0, enabled=1 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_launch,execjob_attach,exechost_periodic,exechost_startup), alarm=90, freq=120}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;execjob_epilogue hook[1] = {PBS_power, order=2000, type=1, enabled=0 user=0, debug=(0) fail_action=(1), event=(execjob_begin,execjob_prologue,execjob_epilogue,execjob_end,exechost_periodic,exechost_startup), alarm=180, freq=300}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;execjob_end hook[0] = {pbs_cgroups, order=100, type=0, enabled=1 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_launch,execjob_attach,exechost_periodic,exechost_startup), alarm=90, freq=120}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;execjob_end hook[1] = {PBS_power, order=2000, type=1, enabled=0 user=0, debug=(0) fail_action=(1), event=(execjob_begin,execjob_prologue,execjob_epilogue,execjob_end,exechost_periodic,exechost_startup), alarm=180, freq=300}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;exechost_periodic hook[0] = {PBS_alps_inventory_check, order=1, type=1, enabled=0 user=0, debug=(0) fail_action=(1), event=(exechost_periodic), alarm=90, freq=300}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;exechost_periodic hook[1] = {pbs_cgroups, order=100, type=0, enabled=1 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_launch,execjob_attach,exechost_periodic,exechost_startup), alarm=90, freq=120}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;exechost_periodic hook[2] = {PBS_power, order=2000, type=1, enabled=0 user=0, debug=(0) fail_action=(1), event=(execjob_begin,execjob_prologue,execjob_epilogue,execjob_end,exechost_periodic,exechost_startup), alarm=180, freq=300}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;exechost_startup hook[0] = {pbs_cgroups, order=100, type=0, enabled=1 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_launch,execjob_attach,exechost_periodic,exechost_startup), alarm=90, freq=120}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;exechost_startup hook[1] = {PBS_power, order=2000, type=1, enabled=0 user=0, debug=(0) fail_action=(1), event=(execjob_begin,execjob_prologue,execjob_epilogue,execjob_end,exechost_periodic,exechost_startup), alarm=180, freq=300}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;execjob_attach hook[0] = {pbs_cgroups, order=100, type=0, enabled=1 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_launch,execjob_attach,exechost_periodic,exechost_startup), alarm=90, freq=120}
10/11/2017 18:46:27;0002;pbs_mom;n/a;ncpus;hyperthreading enabled
10/11/2017 18:46:27;0002;pbs_mom;n/a;initialize;pcpus=24, OS reports 24 cpu(s)
10/11/2017 18:46:27;0100;pbs_python;Hook;pbs_python;main: Hook name is pbs_cgroups
10/11/2017 18:46:27;0100;pbs_python;Hook;pbs_python;main: Event type is exechost_startup
10/11/2017 18:46:27;0100;pbs_python;Hook;pbs_python;main: Hook utility class instantiated
10/11/2017 18:46:27;0100;pbs_python;Hook;pbs_python;__get_vnode_type: Could not determine vntype
10/11/2017 18:46:27;0100;pbs_python;Hook;pbs_python;main: Cgroup utility class instantiated
10/11/2017 18:46:27;0100;pbs_python;Hook;pbs_python;GPUs: {'nvidia0': '0000:04:00.0'}
10/11/2017 18:46:27;0100;pbs_python;Hook;pbs_python;create_vnodes: vnode_per_numa_node is disabled
10/11/2017 18:46:27;0100;pbs_python;Hook;pbs_python;main: Hook handler returned success
10/11/2017 18:46:27;0080;pbs_python;Hook;pbs_python;Elapsed time: 0.4408
10/11/2017 18:46:27;0006;pbs_mom;Fil;pbs_mom;Version 17.1.0, started, initialization type = 0
10/11/2017 18:46:27;0002;pbs_mom;Svr;pbs_mom;Mom pid = 11699 ready, using ports Server:15001 MOM:15002 RM:15003
10/11/2017 18:46:27;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net restore handler called
10/11/2017 18:46:27;0002;pbs_mom;Svr;pbs_mom;Restart sent to server at host01:15001
10/11/2017 18:46:27;0d80;pbs_mom;TPP;pbs_mom(Thread 0);sd 0, Received noroute to dest ***.***.***.***:15001, msg="tfd=18, pbs_comm:***.***.***.***:17001: Dest not found"
10/11/2017 18:46:27;0d80;pbs_mom;TPP;pbs_mom(Thread 0);sd 0, Received noroute to dest ***.***.***.***:15001, msg="tfd=18, pbs_comm:***.***.***.***:17001: Dest not found"
(Note: IP addresses are masked)
Thank you for your kind cooperation!