PP-325: Design document review for cgroups hook

mkaro · February 13, 2017, 9:47pm

This message is to inform the community that the design document for the cgroups hook has been updated and expanded. You may review the document here:

https://pbspro.atlassian.net/wiki/display/PD/PP-325%3A+Support+Cgroups

Please provide comments in response to this post. Thank you!

jon · February 15, 2017, 9:18am

Mike,

In reviewing the document I believe that following is not correct

use_hyperthreads false
When set to true, hyperthreads are treated as though they were physical cores. When false, hyperthreads are not counted as physical cores.

I believe that when false, hyperthreads are not added to the cpuset.cpus list even if the cpu has hyperthreading enabled.

Other than that the document looks good.

mkaro · February 15, 2017, 9:02pm

Thanks for the comment. I have addressed it.

jon · February 15, 2017, 9:05pm

Looks good. I have no further questions/comments.

dtalcott · February 17, 2017, 3:58pm

Would it make sense for exclude_hosts, run_only_on_hosts, and possibly exclude_vntypes to accept lists of regular expressions, rather than just individual names? This could be useful on large clusters, where similar hosts have similar names.

mkaro · February 17, 2017, 5:19pm

It would make sense for those items to accept regular expressions. I think that is an enhancement we should consider implementing in the future. Thank you for your suggestion. I have filed a ticket on your behalf here: https://pbspro.atlassian.net/browse/PP-678

bayucan · February 17, 2017, 10:08pm

The document looks good, Mike! The only thing is under ‘kill_timeout’ be sure to specify the time unit. I’m sure it’s in ‘seconds’.

mkaro · February 17, 2017, 10:21pm

Good point @bayucan. I have updated the document.

iestockdale · February 18, 2017, 1:44am

Mike, is this design intended to cover all eleven of the user stories under PP-325?

altair4 · February 20, 2017, 10:24pm

the Cgroup Configuration File found here contains an example of a
configuration file, but it’s not labeled as an example
if the global exclude_hosts value lists node001 and node002, why is
it necessary to have entries in cgroup:cpuacct and cgroup:cpuset for
them?
why is the cgroup hook run on every node assigned to the job since
it’s certainly possible that one or more of the assigned nodes does
not support cgroups?
if vnode_per_numa_node is true, how are the NUMA-node vnodes named?
how does the cpuset subsystem’s enabled parameter interact with the
PBS pbs_mom.standard vs. pbs_mom.cpuset convention? Does/should the
upgrade process deal with the convention transition? It seems wrong
that the default would be false if installing on an existing PBS
configuration that was previously using CPU sets

mkaro · February 20, 2017, 10:56pm

@altair4: I attempted to address your first four comments by updating the document. I added a note in the “enabled” section of the cpuset subsystem attempting to explain the options administrators my choose from on SGI systems. The installation process does not perform any modification to the cgroup hook configuration file. If an existing configuration file is present, it will not be overwritten. Once the administrator alters the file, their changes will be preserved during future upgrades.

Thank you for your comments, and please let me know if the changes meet with your approval.

altair4 · February 20, 2017, 11:10pm

@mkaro: Yes, I think you’ve covered my questions. Thanks.

bhroam · February 21, 2017, 2:56am

I have a couple of questions:

How does the use of cgroups affect the normal rlimits the mom places on the job?
What do you mean by vntypes? Do you mean the builtin vntype resource? You might want to make this more clear.
What is the difference between creating cpusets and using cgroup cpus/mems/memsw subsystems? If you create a cpuset for a job, you’re boxing the job into its own little world of memory and cpus. If you use the cpus/mems/memsw subsystems you create limits on what the job can use. This is essentially the same thing, right? Is there some reason you’d want to use both together?
Under the memsw subsystem, it says 0MB is the default. Is this correct? In the memory description, it said if memsw wasn’t provided, the memory and memsw limits would be the same. Also, it says physical memory. Shouldn’t that say virtual memory?
I’m wondering why Public/Experimental? This doesn’t mean it’s any easier to change than a Public/Stable interface. It just means you can add it in a patch release. If you’re looking for something easier to change, I think you have to go looking in the PBS Private area, and that doesn’t look like it applies here.

One last thing: From the last time I looked at the code, there were a ton of debug log messages. Since log messages are interfaces, they will need to be documented.

billnitzberg · February 21, 2017, 5:07pm

Really nicely written – all designs should start with Overview and example – thanks @mkaro!

A few comments:

For both the memory and memsw subsystems, it’s not clear to me what “available physical memory” means, e.g., (1) is it the amount configured into the system (e.g., if I put in 64 GB of memory, it is 64 GB, or is it some lesser amount returned by the kernel)?, and (2) is the reserve_percent calculated before or after deducting reserve_amount? (Also, the last sentence defining reserve_amount is self-referential for both memory & memsw).
For nmics, ngpus, vmem, and hpmem, the design states the admin must manually add these to the resources line in the scheduler configuration file for this resource “to be considered”. I suggested appending “for scheduling”. Also, is there any way to make this automatic – seems if users are submitting jobs with “nmics”, for example, they will definitely want them to be scheduled. (Perhaps beyond the scope of this design…)
I’m not sure of the prevalence of using exclusions, but it feels overkill to have both exclude_hosts & exclude_vntypes: why not just have one exclude? Having whichever is more prevelant would certainly suffice (and put a slight additional burden on those admins that need the other type of exclusion).
Finally, a broader comment. I understand that a goal for core PBS Pro is to move all configuration into a single system, e.g., qmgr. This design goes the other way – adding configuration into a separate hook file. Having a separate hook config file for a non-core feature is a great idea (so people can work on extensions without worrying about core PBS Pro), but adding a totally new type of configuration methodology to core PBS Pro seems like moving the wrong direction. Not sure what others think about this…

Again, thanks!

mkaro · February 21, 2017, 11:48pm

The cgroup limits do not affect the rlimits because they are independent mechanisms within the kernel. A process could be denied access to a resource if it violates either limit. The cgroup limits apply to groups of processes (i.e. all processes in a job) while the rlimits are per process.

Document updated.

Creating cpusets from within MoM via libcpuset is exactly the same thing as creating them via the cgroups hook. I added an Administrator Notes section at the bottom of the document and indicated that the cgroup hook should not be used in conjunction with the cpuset MoM.

Yes, 0MB is the default when no configuration file is present. The default in the configuration file packaged with PBS Pro is 256MB. You are correct about “physical” memory… it has been updated to “virtual” memory.

Agreed. Public/experimental seemed the best fit, which probably just confused anyone reading this that does not work for Altair.

Good grief, that would make this document about 100x longer! Not to mention the amount of time required.

mkaro · February 22, 2017, 12:28am

One of the comments from @bhroam identified that “physical” should have been “virtual” for memsw. Physical and virtual memory are obtained from /proc/meminfo as MemTotal and (MemTotal + SwapTotal), respectively.

I updated the document to reflect that reserve_percent is calculated prior to reserve_amount to obtain the total amount reserved. I also correct the self-reference issues.

I added the suggested text. It would be considerably easier to update the scheduler’s resources list if it were accessible via qmgr. I’d have to agree that this is out of scope for this design, but a valid point nonetheless.

The ability to exclude vnode types was primarily targeted for Cray systems where vnode types are commonly defined. The ability to exclude individual hosts was to target individual nodes where the complex does not define vnode types.

I agree that a central repository for all things related to PBS Pro configuration would be of benefit. At the time the cgroups hook was first conceived, the hook configuration file seemed the best fit because the server would push the changes out to all of the MoMs in the complex when it was modified.

bhroam · February 22, 2017, 12:38am

@mkaro, thanks for updating the document.
One quick thing: vntype is a resource, not an attribute.

As for cpusets, I really was just curious about the differences between a cpuset and a cgroup limit and why one would want to use both. From my point of view, they look like they are providing the same service. Is it that a cpuset will allow you to hammer down the exact cpus and memory where a cgroup just limits the amount used?

I’m still confused about the sentence in the memory subsystem that says if there isn’t a virtual memory limit requested, the limit will be set to the physical memory limit. How does this mix with the default of 0MB/256MB? Does the sentence in the memory subsystem just take affect if the memory subsystem is enabled and memsw is not?

As for the log messages, I’d take a pass over them and remove the less important ones and document the rest. I remember some that were less useful than others. Either that or take this up with Ian.

iestockdale · February 22, 2017, 12:59am

Bhroam’s advice is sound.

mkaro · February 22, 2017, 9:02pm

Thanks @bhroam. I updated the document replacing “vntype attribute” with “vntype resource”.

The cpuset subsystem in cgroups is the very same thing that used to be referred to simply as cpusets. SGI provided a library that could be used to manipulate them, but that was more for convenience than anything else. Manually creating the directories and populating the files achieves the same result. The default location of the cpuset filesystem has changed over time, and a prefix is now applied to some of the files by default. It is possible to mount the cpuset filesystem with the “noprefix” flag to restore backward compatibility.

If we set a cgroup limit of 1GB for mem and don’t set the memsw limit, then the kernel will allow the processes to allocate 1GB of RAM and unlimited swap. If we set a limit of 1GB for both mem and memsw, the kernel will allow the processes to access exactly 1GB of memory, regardless of whether it’s in RAM or swap. If we set a cgroup limit of 1GB for mem and 2GB for memsw, the kernel will allow the processes to allocate up to 2GB with no more than 1GB being RAM. Finally, if we set a cgroup limit of 2GB for mem and 1GB for memsw, the kernel will only allow the processes to access 1GB. In terms of a PBS Pro job, if you set your vmem limit lower than your mem limit you’ve made a mistake and the cgroups hook will reject the job. When a user submits a job specifying -lselect=1:mem=1gb we don’t want to grant them access to unlimited swap, but we do want them to have access to the full 1GB they requested, so memsw is set to the same as mem. Hopefully, that description helps.

I’ll add interfaces for relevant log messages.

bhroam · February 22, 2017, 9:16pm

My question has more to do with the nature of a cpuset vs a cgroup limit. I understand that turning on the cpuset subsystem will create cpusets just like the pbs_mom.cpuset would. My question is what is the difference between a cpuset and the cpu/memory/memsw subsystems? Why would someone use one instead of the other, or would they use both at the same time?

mkaro:

If we set a cgroup limit of 1GB for mem and don’t set the memsw limit, then the kernel will allow the processes to allocate 1GB of RAM and unlimited swap. If we set a limit of 1GB for both mem and memsw, the kernel will allow the processes to access exactly 1GB of memory, regardless of whether it’s in RAM or swap. If we set a cgroup limit of 1GB for mem and 2GB for memsw, the kernel will allow the processes to allocate up to 2GB with no more than 1GB being RAM. Finally, if we set a cgroup limit of 2GB for mem and 1GB for memsw, the kernel will only allow the processes to access 1GB. In terms of a PBS Pro job, if you set your vmem limit lower than your mem limit you’ve made a mistake and the cgroups hook will reject the job. When a user submits a job specifying -lselect=1:mem=1gb we don’t want to grant them access to unlimited swap, but we do want them to have access to the full 1GB they requested, so memsw is set to the same as mem. Hopefully, that description helps.

Thank you for your explanation. It helped me understand the subsystems better. I still have a question about the default. If a job doesn’t request vmem, we set the memory/memsw limit to the same thing. My question is this is true, where does the default come into play? When do we use that default if we’re always setting the memory/memsw limit to the same thing?

Bhroam

Topic		Replies	Views
Cgroup error causing suspended jobs Users/Site Administrators	17	3990	October 18, 2018
Design changes for PTL on cpuset mom. (Removal of cpuset skip decorator) Developers	18	953	August 18, 2020
GPU memory as a custom resource Users/Site Administrators	6	3115	January 15, 2018
GPU Access Limited by CGroup Users/Site Administrators	14	8390	June 13, 2018
PSS instead of RSS as mem Users/Site Administrators	11	661	August 10, 2021

PP-325: Design document review for cgroups hook

Related topics