Unexpected error in pbs_cgroups handling exechost_periodic event: IOError (2, 'No such file or directory')

Hello,

We are seeing this incessant error on mom nodes and are not sure exactly why it’s occurring. At first glance it seems there’s an issue with the way Cgroups/PBS interprets the path in question:

06/09/2020 00:01:44;0002;pbs_mom;Svr;Log;Log opened

06/09/2020 00:01:44;0002;pbs_mom;Svr;pbs_mom;pbs_version=18.1.4

06/09/2020 00:01:44;0002;pbs_mom;Svr;pbs_mom;pbs_build=mach=N/A:security=N/A:configure_args=N/A

06/09/2020 00:01:44;0002;pbs_mom;Svr;pbs_mom;hostname=node0124.X.Y;pbs_leaf_name=N/A;pbs_mom_node_name=N/A

06/09/2020 00:01:44;0002;pbs_mom;Svr;pbs_mom;ipv4 interface lo: localhost4

06/09/2020 00:01:44;0002;pbs_mom;Svr;pbs_mom;ipv4 interface eno1: node0124.X.Y

06/09/2020 00:01:44;0002;pbs_mom;Svr;pbs_mom;ipv4 interface ib0: node0124-ib

06/09/2020 00:01:44;0002;pbs_mom;Svr;pbs_mom;ipv6 interface lo: localhost6

06/09/2020 00:01:44;0100;pbs_mom;Hook;pbs_cgroups;exechost_periodic request rejected by ‘pbs_cgroups’

06/09/2020 00:01:44;0100;pbs_mom;Hook;pbs_cgroups;Unexpected error in pbs_cgroups handling exechost_periodic event: IOError (2, ‘No such file or directory’)

06/09/2020 00:03:34;0100;pbs_python;Hook;pbs_python;_get_vnode_type: Could not determine vntype

06/09/2020 00:03:34;0080;pbs_python;Hook;pbs_python;node0124 is not in the approved host list: [‘node0064’, ‘node0115’]

06/09/2020 00:03:34;0100;pbs_python;Hook;pbs_python;init: No cgroups enabled

06/09/2020 00:03:34;0080;pbs_python;Hook;pbs_python;main: Cgroups disabled or none to manage

06/09/2020 00:03:34;0080;pbs_python;Hook;pbs_python;Elapsed time: 0.0129

06/09/2020 00:03:45;0100;pbs_python;Hook;pbs_python;_get_vnode_type: Could not determine vntype

06/09/2020 00:03:45;0080;pbs_python;Hook;pbs_python;_get_assigned_cgroup_resources: No such file: /sys/fs/cgroup/memory/pbspro.slice/pbspro-pbspro-530258.bright01\x2dthx.slice.orphan.slice/memory.memsw.limit_in_bytes

06/09/2020 00:03:45;0080;pbs_python;Hook;pbs_python;[‘Traceback (most recent call last):’, ’ File “”, line 4650, in main’, ’ File “”, line 2008, in init’, ’ File “”, line 2534, in _get_assigned_cgroup_resources’, “IOError: [Errno 2] No such file or directory: ‘/sys/fs/cgroup/cpuset/pbspro.slice/pbspro-pbspro-530258.bright01\\x2dthx.slice.orphan.slice/cpuset.cpus’”]

06/09/2020 00:03:45;0001;pbs_python;Hook;pbs_python;Unexpected error in pbs_cgroups handling exechost_periodic event: IOError (2, ‘No such file or directory’)

06/09/2020 00:03:45;0080;pbs_python;Hook;pbs_python;Elapsed time: 0.0168

06/09/2020 00:03:46;0100;pbs_mom;Hook;pbs_cgroups;exechost_periodic request rejected by ‘pbs_cgroups’

06/09/2020 00:03:46;0100;pbs_mom;Hook;pbs_cgroups;Unexpected error in pbs_cgroups handling exechost_periodic event: IOError (2, ‘No such file or directory’)

06/09/2020 00:05:35;0100;pbs_python;Hook;pbs_python;_get_vnode_type: Could not determine vntype

Any help any one can offer towards resolution would be appreciated…

Thanks,
Siji

Here’s further information:

Upon listing the /sys/fs/cgroup/cpuset/pbspro.slice/ contents , we see there’s a “pbspro-530258.bright01\x2dthx.slice.orphan” but not a “pbspro-pbspro-530258.bright01\x2dthx.slice.orphan.slice”

Is there a reason for this missing directory or alternate directory name?:

[node0124 ~]# ls -ltr /sys/fs/cgroup/memory/pbspro.slice/

total 0

-rw-r–r-- 1 root root 0 Jan 21 16:17 tasks

-rw-r–r-- 1 root root 0 Jan 21 16:17 notify_on_release

-r–r–r-- 1 root root 0 Jan 21 16:17 memory.usage_in_bytes

-rw-r–r-- 1 root root 0 Jan 21 16:17 memory.swappiness

-r–r–r-- 1 root root 0 Jan 21 16:17 memory.stat

-rw-r–r-- 1 root root 0 Jan 21 16:17 memory.soft_limit_in_bytes

---------- 1 root root 0 Jan 21 16:17 memory.pressure_level

-rw-r–r-- 1 root root 0 Jan 21 16:17 memory.oom_control

-r–r–r-- 1 root root 0 Jan 21 16:17 memory.numa_stat

-rw-r–r-- 1 root root 0 Jan 21 16:17 memory.move_charge_at_immigrate

-r–r–r-- 1 root root 0 Jan 21 16:17 memory.memsw.usage_in_bytes

-rw-r–r-- 1 root root 0 Jan 21 16:17 memory.memsw.max_usage_in_bytes

-rw-r–r-- 1 root root 0 Jan 21 16:17 memory.memsw.limit_in_bytes

-rw-r–r-- 1 root root 0 Jan 21 16:17 memory.memsw.failcnt

-rw-r–r-- 1 root root 0 Jan 21 16:17 memory.max_usage_in_bytes

-rw-r–r-- 1 root root 0 Jan 21 16:17 memory.limit_in_bytes

-r–r–r-- 1 root root 0 Jan 21 16:17 memory.kmem.usage_in_bytes

-r–r–r-- 1 root root 0 Jan 21 16:17 memory.kmem.tcp.usage_in_bytes

-rw-r–r-- 1 root root 0 Jan 21 16:17 memory.kmem.tcp.max_usage_in_bytes

-rw-r–r-- 1 root root 0 Jan 21 16:17 memory.kmem.tcp.limit_in_bytes

-rw-r–r-- 1 root root 0 Jan 21 16:17 memory.kmem.tcp.failcnt

-r–r–r-- 1 root root 0 Jan 21 16:17 memory.kmem.slabinfo

-rw-r–r-- 1 root root 0 Jan 21 16:17 memory.kmem.max_usage_in_bytes

-rw-r–r-- 1 root root 0 Jan 21 16:17 memory.kmem.limit_in_bytes

-rw-r–r-- 1 root root 0 Jan 21 16:17 memory.kmem.failcnt

–w------- 1 root root 0 Jan 21 16:17 memory.force_empty

-rw-r–r-- 1 root root 0 Jan 21 16:17 memory.failcnt

-rw-r–r-- 1 root root 0 Jan 21 16:17 cgroup.procs

–w–w–w- 1 root root 0 Jan 21 16:17 cgroup.event_control

-rw-r–r-- 1 root root 0 Jan 21 16:17 cgroup.clone_children

drwxr-xr-x 2 root root 0 May 27 10:35 pbspro-530258.bright01\x2dthx.slice.orphan

-rw-r–r-- 1 root root 0 May 29 16:41 memory.use_hierarchy

drwxr-xr-x 2 root root 0 May 29 16:41 pbspro-532227.bright01\x2dthx.slice

-Siji

Please

  1. make sure your cgroup subsystem is available on the compute nodes.
  2. cgroups.json ( configuration file of the cgroup) is correctly defined
  3. Hope you are using this cgroup hook:
    https://github.com/openpbs/openpbs/blob/master/src/hooks/cgroups/pbs_cgroups.PY

and you have read this documentation:
https://pbspro.atlassian.net/wiki/spaces/PD/pages/1385103372/Improve+cgroup+hook+and+configuration+file+support+on+heteregeneous+clusters

Hint:
Do you have this file called vntype in this location : $PBS_HOME/mom_priv/vntype
This file should contain the vntype string for a specific type of string to denote nodes “gpu” or “highmem” or “lowmem”

Before upgrading from the stock cgroup hook to the cgroup hook from the above link , delete the contents of $PBS_HOME/mom_priv/hook_data/cgroup_jobs

Hope this helps

Adarsh,

Thanks so much for the input!

Based on the following I think our groups subsystem is functional:

[…]# systemd-cgls
├─1 /usr/lib/systemd/systemd
├─user.slice
│ └─user-0.slice
│ └─session-153515.scope
│ ├─226586 sshd: root@pts/0
│ ├─226588 -bash
│ ├─226924 systemd-cgls
│ └─226925 less
└─system.slice
├─pbs.service
│ └─123390 /cm/shared/apps/pbspro-ce/current/sbin/pbs_mom

So here’s what our cgroups.json file looks like:

[…]# cat pbs_cgroups.json
{
“cgroup_prefix” : “pbspro”,
“exclude_hosts” : [“node0064”, “node0115”],
“exclude_vntypes” : [“no_cgroups”],
“run_only_on_hosts” : ,
“periodic_resc_update” : true,
“vnode_per_numa_node” : false,
“online_offlined_nodes” : true,
“use_hyperthreads” : false,
“ncpus_are_cores” : true,
“cgroup” : {
“cpuacct” : {
“enabled” : true,
“exclude_hosts” : ,
“exclude_vntypes” :
},
“cpuset” : {
“enabled” : true,
“exclude_cpus” : ,
“exclude_hosts” : ,
“exclude_vntypes” : ,
“mem_fences” : true,
“mem_hardwall” : false,
“memory_spread_page” : false
},
“devices” : {
“enabled” : false,
“exclude_hosts” : ,
“exclude_vntypes” : ,
“allow” : [
“b : rwm”,
“c : rwm”
]
},
“hugetlb” : {
“enabled” : false,
“exclude_hosts” : ,
“exclude_vntypes” : ,
“default” : “0MB”,
“reserve_percent” : “0”,
“reserve_amount” : “0MB”
},
“memory” : {
“enabled” : true,
“exclude_hosts” : ,
“exclude_vntypes” : ,
“soft_limit” : false,
“default” : “1GB”,
“reserve_percent” : “2”,
“reserve_amount” : “512MB”
},
“memsw” : {
“enabled” : true,
“exclude_hosts” : ,
“exclude_vntypes” : ,
“default” : “1GB”,
“reserve_percent” : “2”,
“reserve_amount” : “512MB”
}
}
}

and here are our pbs_cgroups hook attributes:

create hook pbs_cgroups
set hook pbs_cgroups type = site
set hook pbs_cgroups enabled = true
set hook pbs_cgroups event = execjob_begin
set hook pbs_cgroups event += execjob_epilogue
set hook pbs_cgroups event += execjob_end
set hook pbs_cgroups event += execjob_launch
set hook pbs_cgroups event += execjob_attach
set hook pbs_cgroups event += exechost_periodic
set hook pbs_cgroups event += exechost_startup
set hook pbs_cgroups user = pbsadmin
set hook pbs_cgroups alarm = 90
set hook pbs_cgroups freq = 120
set hook pbs_cgroups order = 100
set hook pbs_cgroups debug = true
set hook pbs_cgroups fail_action = none

As suspected, the $PBS_HOME/mom_priv/vntype is missing - I don’t believe it was ever installed. Where should we bother with this? Is there a template we can review?

I’ll wait for your comments before attempting an upgrade of our pbs_cgroups.PY hook.

Thanks,
Siji

This file called vntype has to be manually created , for example

cat $PBS_HOME/mom_priv/vntype
gpu

We could see this in the errors shared above , vntype.

Sample cgroup.json

{
   "cgroup_prefix":"pbspro",
   "enabled":"vntype in: amd_node",
   "periodic_resc_update":true,
   "vnode_per_numa_node":"vntype in: amd_node",
   "online_offlined_nodes":"vntype in: amd_node,exp_node,imp_node,gpu_node, intel_node",
   "cgroup":{
      "cpuacct":{
         "enabled":true,
         "exclude_hosts":[

         ]
      },
      "cpuset":{
         "enabled":true,
         "exclude_hosts":[

         ],
         "exclude_vntypes":[

         ],
         "memory_spread_page":true,
         "mem_hardwall":false,
         "mem_fences":"vntype in: amd_node"
      },
      "devices":{
         "enabled":false,
         "exclude_hosts":[

         ],
         "exclude_vntypes":[

         ],
         "allow":[
            "b *:* rwm",
            "c *:* rwm",
            [
               "mic/scif",
               "rwm"
            ],
            [
               "nvidiactl",
               "rwm",
               "*"
            ],
            [
               "nvidia-uvm",
               "rwm"
            ]
         ]
      },
      "hugetlb":{
         "enabled":false,
         "default":"0MB",
         "exclude_hosts":[

         ],
         "exclude_vntypes":[

         ]
      },
      "memory":{
         "enabled":true,
         "default":"256MB",
         "reserve_amount":"32GB",
         "exclude_hosts":[

         ],
         "exclude_vntypes":[

         ]
      },
      "memsw":{
         "enabled":false,
         "default":"256MB",
         "reserve_memory":"2gb",
         "exclude_hosts":[

         ],
         "exclude_vntypes":[

         ]
      }
   }
}