I have a user who wants to have a reservation on a specific host for a period of time using pbs_rsub. When they try to select the host using the following reservation:
pbs_rsub -R … -E … -l ncpus=16,mem=128GB,host=phys01 -l place=excl
The reservation is confirmed, and gets placed on vnode phys01[0] which has 16ncpus and 194GB of memory.
When submitting to this reservation, they get the following error:
Not Running: PBS Error: Execution server rejected request and ter
minated
When looking at the mom_log i see the following:
04/30/2024 15:22:01;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_begin, job ID is 1219.sched01
04/30/2024 15:22:01;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpu,cpuacct/pbs_jobs.service/jobid/1219.sched01/
04/30/2024 15:22:01;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpuset/pbs_jobs.service/jobid/1219.sched01/
04/30/2024 15:22:01;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/memory/pbs_jobs.service/jobid/1219.sched01/
04/30/2024 15:22:01;0100;pbs_python;Hook;pbs_python;configure_job: vmem not requested, assigning 7372800k to cgroup
04/30/2024 15:22:01;0080;pbs_python;Hook;pbs_python;['Traceback (most recent call last):', ' File "<embedded code object>", line 5542, in main', ' File "<embedded code object>", line 972, in invoke_handler', ' File "<embedded code object>", line 1021, in _execjob_begin_handler', ' File "<embedded code object>", line 4573, in configure_job', ' File "<embedded code object>", line 3944, in assign_job', ' File "<embedded code object>", line 3782, in _assign_resources', 'TypeError: slice indices must be integers or None or have an __index__ method']
04/30/2024 15:22:01;0001;pbs_python;Hook;pbs_python;Unexpected error in pbs_cgroups handling execjob_begin event for job 1219.sched01 (system hold set): TypeError ('slice indices must be integers or None or have an __index__ method',)
04/30/2024 15:22:01;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 1219.sched01, event_type 64 (elapsed time: 0.3524)
04/30/2024 15:22:01;0100;pbs_mom;Hook;pbs_cgroups;execjob_begin request rejected by 'pbs_cgroups'
04/30/2024 15:22:01;0008;pbs_mom;Job;1219.sched01;Unexpected error in pbs_cgroups handling execjob_begin event for job 1219.sched01 (system hold set): TypeError ('slice indices must be integers or None or have an __index__ method',)
Our cgroups config:
{
"cgroup_prefix" : "pbs_jobs",
"exclude_hosts" : [],
"exclude_vntypes" : ["no_cgroups"],
"run_only_on_hosts" : [],
"periodic_resc_update" : true,
"vnode_per_numa_node" : "vntype in : phys",
"propogate_vntype_to_server" : true,
"online_offlined_nodes" : true,
"use_hyperthreads" : true,
"ncpus_are_cores" : "vntype in : phys",
"cgroup" : {
"cpuacct" : {
"enabled" : true,
"exclude_hosts" : [],
"exclude_vntypes" : []
},
"cpuset" : {
"enabled" : true,
"exclude_cpus" : [],
"exclude_hosts" : [],
"exclude_vntypes" : [],
"mem_fences" : false,
"mem_hardwall" : false,
"memory_spread_page" : false
},
"devices" : {
"enabled" : false,
"exclude_hosts" : [],
"exclude_vntypes" : [],
"allow" : [
"b *:* rwm",
"c *:* rwm"
]
},
"hugetlb" : {
"enabled" : false,
"exclude_hosts" : [],
"exclude_vntypes" : [],
"default" : "0MB",
"reserve_percent" : 0,
"reserve_amount" : "0MB"
},
"memory" : {
"enabled" : true,
"exclude_hosts" : [],
"exclude_vntypes" : [],
"soft_limit" : true,
"default" : "256MB",
"reserve_percent" : 0,
"reserve_amount" : "64MB"
},
"memsw" : {
"enabled" : true,
"exclude_hosts" : [],
"exclude_vntypes" : [],
"default" : "256MB",
"reserve_percent" : 0,
"reserve_amount" : "64MB"
}
}
}
I am unsure why this is happening. The vntype of the vNode is phys.