Design changes for PTL on cpuset mom. (Removal of cpuset skip decorator)

Hi,

I have created design document for PTL changes required for cpuset mom in order to remove use of skip cpuset decorator. Please review and suggest if any changes required.

https://openpbs.atlassian.net/wiki/spaces/PD/pages/2002812929/Design+changes+for+PTL+on+cpuset+mom.+Removal+of+cpuset+skip+decorator

Thanks & Regards,
Vishesh

Hey Vishesh,
I like the change. I’ve hopped we would do this for quite a while.

I have two comments.
I have a doubt. Point 2 says you’ll skip the test if ncpus/mem are higher than their real amounts. Points 3-5 say you’ll ignore changes to ncpus/mem. Maybe I am misunderstanding.

My other comment is about naming of vnodes. I think for the most part tests don’t care what the vnodes are called so your change is fine. They supply a name because it is a mandatory field. There are some tests who need to know what the nodes are named. Maybe we can get around with with having create_vnodes() return a list of vnode names. It currently returns True/False, and returning [list] or [] is still the same if you say if X:

Bhroam

Looks good so far, Vishesh, and in line with what we discussed. For the following, I think you’ll be using the mom short hostname for vnode name:

I propose to make mom hostname as default vnode name rather than using vn or vnode. Make changes to all the tests which use any other vnode name.

@visheshh this is helping to move things in the right direction, thanks for doing this.
I have a few more comments in addition to what we previously discussed.

For #2 I think you should also mention mem (…“more ncpus/mem on a node”…)

  1. In a cluster if there is cpuset mom and non cpuset mom and multinode jobs are submitted then skip the test case.

This will cause more tests to be skipped than get skipped currently. Many tests run well across cpuset and non-cpuset moms. I don’t see a benefit to #7.

Hi @bhroam,

I have a doubt. Point 2 says you’ll skip the test if ncpus/mem are higher than their real amounts. Points 3-5 say you’ll ignore changes to ncpus/mem. Maybe I am misunderstanding.

Vishesh -> If number of requested ncpus are more than available ncpus on nodes then there is no way the test will run, so we will skip the test case.

If number of requested ncpus is less than available ncpus on nodes, then we will ignore the request and try to run the test as changing ncpus is not supported on cpuset mom.

Thanks for the suggestion on returning vnode list. Will keep that in mind.

Hi @bayucan, Yes i will use mom.shortname for vnode name.

Hi @lisa-altair, When i tried to run a multinode job on a cpuset mom and non cpuset mom, the job actually hung. So i beleived it is not a supported configuration to run the test. May be i need to get that doubt clarified. I want to know if it is supported way of running jobs ?

Hey Vishesh,
I see what you are doing, but I have to disagree. If you don’t set the amount of ncpus or mem, you will cause test failures. A test sets the right amount of cpus it expects. If you are leaving it higher than what it expects, the test will likely fail. Think of the case where I set an 8 cpu node to 4 cpus. I submit 6 1 cpu jobs and expect 2 to be queued. By leaving all 8 cpus available, all 6 jobs will run and the test will fail.

I don’t think there is any problem lowering the amount of cpus/mem on a node. The problem with the cgroups hook is when it receives a request to run a job which it doesn’t have resources (raised the number of cpus). If there are less cpus, then the cgroups hook just receives requests for less than the total amount of cpus. Think of it this way, if I send 6 of 8 cpus to the cgroups hook, the cgroups hook can’t tell if the scheduler is not sending more work to it because it thinks there are only 6 cpus or that there are only 6 cpus of work to run.

Bhroam

Hi @visheshh, many of our existing multi-mom (where moms needed is > 1) tests are run on a mix of cpuset and non-cpuset systems. As I understand it, it is a supported way of running jobs.

Hi @bhroam,

On a cpuset mom, we cannot modify the number of ncpus on a node.
If we modify the number of ncpus/mem on the node, cgroups hook will instantly turn it back to original number as they were initially set by cgroups hook. Mostly other than ncpus and mem rest of the attributes we can change on the nodes.

I understand your concern that tests will fail if we do not decrease. That is the reason i had listed caveat 2, that we need to see and fix the tests by case by case basis.

Hey @visheshh,
I don’t think you will achieve what you want with this change. If you ignore setting ncpus/mem on a cpuset mom, you will get tests which work when people originally test them but fail the first time they are run on a cpuset mom. This will cause tests to get merged, and then will cause a headache to QA. We’ll be back in the same situation where we have tons of tests which are being skipped, but it will take time out of QAs life to mark them as such.

I see two ways you can go with this. The first is a change to the cgroups hook to allow the number of cpus to be changed via qmgr. You can then change create_vnodes() to create the correct number of vnodes, and then use qmgr to set ncpus/mem properly. This is of course if it is being set to <= the number of cpus/mem on the vnode. We usually trust the admin to be right, so this might be the best way to go.

The other way is to skip any test which does not map ncpus/mem to the exact amount on the cpuset vnodes. This basically will lead us to skipping all the tests we do today, but won’t require a decorator to do it.

Bhroam

Hi @bayucan,
Would it possible for us to modify cgroups hook to not revert ncpus and mem after setting from qmgr for running of our tests from PTL purpose?

Thanks & Regards,
Vishesh

Hi @bayucan,
Would it possible for us to modify cgroups hook to not revert ncpus and mem after setting from qmgr for running of our tests from PTL purpose?

I don’t think so. The cgroups hook internally figures out what’s available in resources, and report it as such. This is done by the exechost_startup and exechost_periodic hook aspects of pbs_cgroups hook. Turning off these events could lead to unexpected outcomes.

Hey @bayucan
This is unfortunate. In the past we’ve always trusted the admin. If they had a reason to change a resource via qmgr, we trusted that value. It might have caused problems on the mom side, but it was the admin who shot themselves in the foot. With the cgroups hook, we always correct the values to the real values. It will make sure the admin doesn’t shoot themselves in the foot, but it also breaks consistency with what we’ve done in the past.

@visheshh since this is the case, I see no other solution than to skip the test.

Bhroam

PBS and cgroups still allows the admin this power. However, for PTL testing purposes, the test will error (equivalent to shooting self in foot) because qmgr changed the value of ncpus/mem out from what the hook knows about, so the hook will have an exception and the test will fail. Thus currently these tests get @skipOnCpuSet decorators. I think the goal is to try to run more of the tests that currently have the @skipOnCpuSet decorator by trying to make create_vnodes smarter.

@lisa-altair I’m not talking about allowing PTL to shoot itself in the foot. Vishesh already has it in his design that the test will be skipped if ncpus or mem is increased. That would be bad, and indeed the test should be skipped. I’m talking about the case when ncpus or mem is decreased. In that case, the cgroups hook can be blissfully ignorant on why it never gets the full amount. The cgroups hook can’t tell the difference between the scheduler only assigning some cpus of the node, or that there aren’t enough jobs to use all the cpus. We should be able to use qmgr to decrease the number of cpus/mem. From what Al said, the cgroups hook will always self correct.

Bhroam

Hi @bayucan @lisa-altair @bhroam,

I think we are all on same page that we should skip the test cases, when create_vnodes()/manager() is called to create nodes or set ncpus and requested amount of ncpus does not match the amount of ncpus available on the node.

I had a question though, if number of ncpus requested to set is 1. In that case either test wants to run one job/resv, can we set the node’s sharing attribute to ignore_excl to mimic the behavior?

Thanks,
Vishesh

I had a question though, if number of ncpus requested to set is 1. In that case either test wants to run one job/resv, can we set the node’s sharing attribute to ignore_excl to mimic the behavior?

I think it’s fine.

Hi, @bhroam, @bayucan, @lisa-altair,

I have updated EDD as per our discussion and will start implementation of the same.

Thanks,
Vishesh