PP-759: possibility to disable job-wide limit enforcement for exclusive jobs

Hello,

I have posted design proposal for PP-759.

If we use job-wide limit enforcement, the exclusive jobs are also killed once the requested resources are exceeded even though the node is fully dedicated to the job. We should have a mom’s config variable to configure this behavior.

For example: Let’s have a node with 32 cpus and a job with request for 16 cpus exclusively. If the job is started on the node, no other jobs can run on the same node and thus some limitations can be ignored. It can be useful to allow the job to use the node without job-wide limitations like ncpus or mem. Of course the limits for walltime or cputime are still applied.

Please provide your feedback.

Thank you,
Vasek

I like the idea and have a few comments:

  • I think this proposal needs to take into account multi-vnoded hosts and state that the new enforcement behavior only occurs if all of the vnodes that the pbs_mom in question manages are “state = job-exclusive”. As written, I read this as operating on a per-vnode level, but the setting is per-mom. More than one “exclusive job” (literally -lplace=excl) can run on a single multi-vnoded host and each individual vnode will be “state = job-exclusive”. Maybe this will matter less in the near future as the number of configurations using multiple vnodes per host WITHOUT some sort of containment utility drops (for example, because of PBS Pro’s use of cgroups, which brings its own enforcement capabilities and I believe already takes exclusive allocation into account), but I think it still needs to be consistent within the product.

  • I also find the name name “enforcement_on_excl” to be confusing. PBS Pro also has -lplace=exclhost, which will allocate all vnodes to a job and set them all to “state = job-exclusive”, which is a good way to met the requirement I propose above for the new behavior, but re-naming this to “enforcement_on_exclhost” would be problematic since you can get of course get all vnodes on a host to be “state = job-exclusive” without actually using exclhost if you have the “right” resource request. I’ll keep thinking of a name I like better, if anyone has suggestions please share!

  • Much more minor, but on the name subject again I think just changing the conjugation of the verb to “enforce_on_excl” makes it more consistent with the existing “enforce xxx” options.

Vasek,

Thanks for posting this. I really like the idea. A few comments:

  • I think this might be better done upstream of the mom at the server level. This works for disabling the standard enforce * options but it does not account for other modes of enforcements (i.e. cgroups) since they look at what was requested.
  • If done upstream at the server level then the run request could be modified (memory only) so that we could still enforce queue or server level memory limits since the exclusive job made all of the memory on a given node unavailable.

Jon

Thank you for your comments.

@scc

  • You are right, the multi-vnoded hosts should be definitely considered. It makes sense to ignore the job-wide limits only if the whole host is allocated. So what about to apply the new behavior only for jobs with ‘-l place=exclhost’. If job has ‘-l place=excl’, the present behavior will be preserved and all the limits will be enforced. It will also allow us to use the name ‘enforce_on_exclhost’.

@jon

  • I believe it is useful to keep the setting on the mom side in mom config since you can use different settings on different moms. This can be especially useful in heterogeneous infrastructure.

  • How about to add new option to the cgroups hook. This new option will also allow to ignore cgroups for jobs with ‘-l place=exclhost’.

Any new thoughts on this?

We actually want this because of nodes with hyperthreading. In our infrastructure we always set ncpus = physical cores with cgroups disabled. We dont want all jobs to use HT. If a user want to use HT, we suppose the user will ask for the host exclusively.

Do you have better idea how to deal with HT nodes?

Vasek

Hi @vchlum,

Thanks for your contribution!
I have a question about the parameter though - what will happen if the enforce_on_exclhost is set to FALSE and the job exceeds the resources that are associated to the node? should we still let the job to continue or kill the job?

I think in the code when instead of setting “enforce_job_wide” to false we should rather do something like enforce_node_wide and make sure that the job does not exceed the node resources. What do you think?

Regards,
Arun

Hi @arungrover,
Thanks for your reply.

If enforce_on_exclhost=False then the job will continue even if the resources are exceeded and the job is not killed. It is my intention to let the job to consume as much resources as needed.

OK, so you think to set new limits according to the node resources… Well, are all the total resources available in mom? Can’t find them… There are some values in vnlp structure but the ncpus value doesn’t seem to respect a custom value set through qmgr.

If this is not good then maybe it is better to do this upstream at the server level as mantioned above.

@vchlum my only worry was that a job may end up consuming more than what a node reports (if its resources report custom value set through qmgr). If mom does not receive this custom resource values (As you mentioned in your comment) then it will be a much bigger change to make server communicate custom resource values and honor them.

I’m not sure if applying these limits at server is right thing too. I feel server is already doing so much, checking job’s used resources and checking/enforcing limits for each running job every time an update is received will be time consuming for server. Another way to deal with this could have been that server updates the job resource request itself but that will create problems if the job gets requeued.

In short, I don’t really have a right answer to what we should be doing here. Maybe what you have expressed in your design proposal is the right thing to do.

@arungrover Concerning your worry: Does it really matter if the job consume more than what the node reports? What can actually happen? I think it is safe to allow the job to use all cpus of the host. It is simply safe because the job is exclusively on the host. On the other side, if the job consumes too much memory and the host starts to swap the memory then the swapping may result in host failure…

I will check the mom code again. Maybe, I overlooked something. I think, it is a good idea to change the limit for the job on the mom side.

I have to agree with @jon regarding cgroups. The cgroup hook is only aware of the resources assigned to a job on the local node, and is unaware of any placement directives including exclhost. From that standpoint, what you are proposing and what exists today with the cgroups hook are incompatible. That said, if mom (and subsequently the cgroups hook) were aware of placement directives then it could be made to work as you propose.

I was just thinking about a use case where nodes could be shared between HPC workload and other purposes (like enterprise). In such cases if admins set a limited number of resources for HPC workload then PBS will see that limited number but the jobs with this change potentially can consume more resources (if they take hosts exclusively).
Maybe I’m totally wrong or overthinking about the use case when practically such a scenario does not exists.

There is now a proposal to integrate most of the items discussed in the cgroup hook.

https://openpbs.atlassian.net/wiki/spaces/PD/pages/2576613377/Improve+memory+swap+management+in+the+cgroup+hook

And there is a linked discussion.

Handling of hyperthreading was already addressed in earlier pull requests for the cgroup hook: you can choose whether to count each thread as a single “ncpus”, to count each core as a single ‘ncpus’ (each job asking for 1 ncpus gets all threads assigned to its cpuset), or to constrain cpu threads used by the cgroup hook to the first thread only (leaving the others for OS processes but not jobs).