Error when specifying host on Reservations with cgroups enabled

I have a user who wants to have a reservation on a specific host for a period of time using pbs_rsub. When they try to select the host using the following reservation:

pbs_rsub -R … -E … -l ncpus=16,mem=128GB,host=phys01 -l place=excl

The reservation is confirmed, and gets placed on vnode phys01[0] which has 16ncpus and 194GB of memory.

When submitting to this reservation, they get the following error:

Not Running: PBS Error: Execution server rejected request and ter
minated

When looking at the mom_log i see the following:

04/30/2024 15:22:01;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_begin, job ID is 1219.sched01
04/30/2024 15:22:01;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpu,cpuacct/pbs_jobs.service/jobid/1219.sched01/
04/30/2024 15:22:01;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpuset/pbs_jobs.service/jobid/1219.sched01/
04/30/2024 15:22:01;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/memory/pbs_jobs.service/jobid/1219.sched01/
04/30/2024 15:22:01;0100;pbs_python;Hook;pbs_python;configure_job: vmem not requested, assigning 7372800k to cgroup
04/30/2024 15:22:01;0080;pbs_python;Hook;pbs_python;['Traceback (most recent call last):', '  File "<embedded code object>", line 5542, in main', '  File "<embedded code object>", line 972, in invoke_handler', '  File "<embedded code object>", line 1021, in _execjob_begin_handler', '  File "<embedded code object>", line 4573, in configure_job', '  File "<embedded code object>", line 3944, in assign_job', '  File "<embedded code object>", line 3782, in _assign_resources', 'TypeError: slice indices must be integers or None or have an __index__ method']
04/30/2024 15:22:01;0001;pbs_python;Hook;pbs_python;Unexpected error in pbs_cgroups handling execjob_begin event for job 1219.sched01 (system hold set): TypeError ('slice indices must be integers or None or have an __index__ method',)
04/30/2024 15:22:01;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 1219.sched01, event_type 64 (elapsed time: 0.3524)
04/30/2024 15:22:01;0100;pbs_mom;Hook;pbs_cgroups;execjob_begin request rejected by 'pbs_cgroups'
04/30/2024 15:22:01;0008;pbs_mom;Job;1219.sched01;Unexpected error in pbs_cgroups handling execjob_begin event for job 1219.sched01 (system hold set): TypeError ('slice indices must be integers or None or have an __index__ method',)

Our cgroups config:

{
    "cgroup_prefix"         : "pbs_jobs",
    "exclude_hosts"         : [],
    "exclude_vntypes"       : ["no_cgroups"],
    "run_only_on_hosts"     : [],
    "periodic_resc_update"  : true,
    "vnode_per_numa_node"   : "vntype in : phys",
    "propogate_vntype_to_server" : true,
    "online_offlined_nodes" : true,
    "use_hyperthreads"      : true,
    "ncpus_are_cores"       : "vntype in : phys",
    "cgroup" : {
        "cpuacct" : {
            "enabled"            : true,
            "exclude_hosts"      : [],
            "exclude_vntypes"    : []
        },
        "cpuset" : {
            "enabled"            : true,
            "exclude_cpus"       : [],
            "exclude_hosts"      : [],
            "exclude_vntypes"    : [],
            "mem_fences"         : false,
            "mem_hardwall"       : false,
            "memory_spread_page" : false
        },
        "devices" : {
            "enabled"            : false,
            "exclude_hosts"      : [],
            "exclude_vntypes"    : [],
            "allow"              : [
                "b *:* rwm",
                "c *:* rwm"
            ]
        },
        "hugetlb" : {
            "enabled"            : false,
            "exclude_hosts"      : [],
            "exclude_vntypes"    : [],
            "default"            : "0MB",
            "reserve_percent"    : 0,
            "reserve_amount"     : "0MB"
        },
        "memory" : {
            "enabled"            : true,
            "exclude_hosts"      : [],
            "exclude_vntypes"    : [],
            "soft_limit"         : true,
            "default"            : "256MB",
            "reserve_percent"    : 0,
            "reserve_amount"     : "64MB"
        },
        "memsw" : {
            "enabled"            : true,
            "exclude_hosts"      : [],
            "exclude_vntypes"    : [],
            "default"            : "256MB",
            "reserve_percent"    : 0,
            "reserve_amount"     : "64MB"
        }
    }
}

I am unsure why this is happening. The vntype of the vNode is phys.

It seems the error is occuring under the following circumstances:

  • The reservation requests excl placement (place=excl)
  • The reservation targets a host that has vnode_per_NUMA_node enabled.

I can target a host using excl placement (not exclhost) that doesn’t have vnode_per_NUMA_node enabled, and the job will complete without issue.

I’ve narrowed down this cgroups bug to a bug that was supposedly fixed in the version of PBS we have cgroups hook problem with hyperthreading · Issue #1817 · openpbs/openpbs · GitHub. I was able to fix my issue by disabling ncpus_are_cores