Enable cgroup with Intel MPI mpirun on sister exec nodes

Hi, I’m currently working on a hook to bind job processes to cgroup on PBS 14.1.2. I know that there’s a cgroup hook on v18 so I checked it carefully and found it using execjob_attach . So that I setup a test hook writing out only the job id, to ensure that the hook is executed on each sister exec nodes.
Unfortunately I saw a failure on it. Only the primary exec host outputs the log. I tried to add execjob_launch and execjob_prologue events to the hook but no help.
I’m using Intel MPI, installed (as a component of) by official installer of Intel Parallel Studio and mpirun -machinefile to run on multiple hosts. There’s no special settings to integrate it with PBS, which I think is where the problem lies.
I wonder what’s the best practice to achieve my goal. The final method I’ve thought of is to grab vnode_list from execjob_launch event and notify a custom daemon running on each computing node, then do the cgroup job in the daemon. As for it needs extra effort for writing the daemon, I’d like to know whether there are any better ways.

would it be possible to share the purpose of using cgroups hook here ?

  1. Could you please disable cgroups hook and test a sample pbs script , and check whether it is executing on the sister nodes.

#PBS -N pbs-intelmpi
#PBS -l select=2:ncpus=2:mpiprocs=2
#PBS -l place=scatter
I_MPI_HYDRA_BOOTSTRAP=rsh
I_MPI_HYDRA_BOOTSTRAP_EXEC=/opt/pbs/bin/pbs_tmrsh
/opt/intel/impi/2018.3.222/bin64/mpirun -np 4 /bin/hostname -machinefile $PBS_NODEFILE

#$PBS_NODEFILE is dynamically created by PBS Professional, it is not required as recent version of Intel MPI are integrated with PBS Pro.

2 . Now enable the cgroup hook and test the same job script.

  1. Search the keyword PBS_NODEFILE in the below document for more information
    https://www.pbsworks.com/pdfs/PBS18.2_BigBook.pdf

Hi adarsh, I’m using cgroup to limit the memory usage of certain jobs.

Your job script’s output was definitely something like

nodeX
nodeX
nodeY
nodeY

I’ve confirmed it right now.

I know that MPI integration is done by setting envs that makes MPI(mpirun) takes pbs’s rsh or something like that. However my goal is to prevent user’s job from affecting other jobs running on the same node. As written in your script, a user can simply left these envs out, or even explicitly unset them, to bypass the restriction.

So I think what I wantn is a totally “transparent” solution for the end users, and thus the policy enforced. I do know that mpirun starts sister-node-processes by rsh or ssh, so that PBS can do little if mpirun refuses to notify pbs mom on these sister nodes. Is there a chance that by setting global bashrc (I can ensure users to use bash as their default shell) and take the magic at that stage?

PS: thank you again for your clarification about PBS_NODEFILE. I know it, and it performs well so that it’s not related to the question in my mind.

Note: that it is not remote invocation of commands using SSH, it is tightly controlled and manageed by PBS and everything is accounted for with respect to cput, memory , etc

Please check the below figure in the High-performance Computing (HPC) and Cloud Solutions | Altair
Figure 5-1: PBS knows about processes on vnodes 2 and 3, because pbs_tmrsh talks directly to pbs_mom, and pbs_mom starts the processes on vnodes 2 and 3

Please check : $enforce mem in the High-performance Computing (HPC) and Cloud Solutions | Altair , this might suffice your requirement.

Also if CGroups hook is enabled by default , the memory confinement based on the qsub request should be managed by default, there is no additional configuration that needs to be done for a Intel MPI job.

$enforce mem is set on all vnodes. But according to tests, PBS’s poll-then-kill method is not enough for my situation. It has been reported user’s program allocated huge amount RAM in a sudden, causing (sometimes) other process all died, even if pbs_mom monitoring the memory usage. So I finally decided to make use of cgroup for that.

I noticed that in Admin Book it writes only if processes are started via pbs_tmrsh on vnode 2 or 3 can pbs_mom knows them. And I can’t find a way to enforce mpirun to use pbs_tmrsh, rather than plain ssh.

PS:I’m on PBS v14.1.2 so that the cgroup hook is not there. You can say that I’m trying to “backport” it :slight_smile: .

PPS: Check https://pastebin.com/nQq0d24L for the result of job script you posted before. Wired that sister node didn’t run any hook code.
mpirun --version:

Intel(R) MPI Library for Linux* OS, Version 2019 Build 20180829 (id: 15f5d6c0c)
Copyright 2003-2018, Intel Corporation.

thank you for the information

The below lines in the PBS Script does the job:
I_MPI_HYDRA_BOOTSTRAP=rsh
I_MPI_HYDRA_BOOTSTRAP_EXEC=/opt/pbs/bin/pbs_tmrsh

please read the pbs user guide

ssh node$i cat /tmp/pbs* # please cat it to file on a shared location accessible by all the nodes

Updated:

  • check the mom logs on the sister nodes, to see whether any mom hooks were executed

You could use an execjob_begin hook to set the environment variables, they would get propagated to the job’s environment. I’m not sure if that would override the user’s settings in the script?

1 Like

I’ve checked it carefully, using the following python code in the hook:

hostname= subprocess.check_output('hostname')

f = open('/tmp/pbs-hook-cgroup.log', 'a')
try:
    head="############start \n {}  jobid={}  hostname={}".format(EVENT_NAMES[e.type],e.job.id,hostname)
    pbs.logmsg(pbs.LOG_WARNING,head)
    f.write(head+"\n")
    f.flush()
...

There’s no system level log of the hook in mom_logs so I added the pbs.logmsg as well as the file logging in /tmp, even after running qmgr -c "set hook cgroup debug = true". I used tmp rather than a shared location because I want to know exactly on which host was the hook executed.

The result remains the same: whether I set I_MPI_XX envs or not, hooks are only executed on primary vnode(vnode1). Only on primary vnode can I get the head string in mom_logs and /tmp/pbs-cgroup-hook.log.

mom_logs on vnode2 looks like (I’ve run “echo > xxx” to clear the log before job submission):

11/30/2018 11:06:05;0100;pbs_mom;Req;;Type 85 request received from root@10.10.10.85:15001, sock=1
11/30/2018 11:06:05;0080;pbs_mom;Hook;cgroup.HK;copy hook-related file request received
11/30/2018 11:06:14;0008;pbs_mom;Job;33370.w003;JOIN_JOB as node 1
11/30/2018 11:06:15;0008;pbs_mom;Job;33370.w003;KILL_JOB received
11/30/2018 11:06:15;0008;pbs_mom;Job;33370.w003;kill_job
11/30/2018 11:06:15;0008;pbs_mom;Job;33370.w003;DELETE_JOB received
11/30/2018 11:06:15;0008;pbs_mom;Job;33370.w003;kill_job
11/30/2018 11:06:32;0008;pbs_mom;Job;33371.w003;JOIN_JOB as node 1
11/30/2018 11:06:33;0008;pbs_mom;Job;33371.w003;KILL_JOB received
11/30/2018 11:06:33;0008;pbs_mom;Job;33371.w003;kill_job
11/30/2018 11:06:33;0008;pbs_mom;Job;33371.w003;DELETE_JOB received
11/30/2018 11:06:33;0008;pbs_mom;Job;33371.w003;kill_job

while on primary exec node the log is:

11/30/2018 11:06:05;0100;pbs_mom;Req;;Type 85 request received from root@10.10.10.85:15001, sock=1
11/30/2018 11:06:05;0080;pbs_mom;Hook;cgroup.HK;copy hook-related file request received
11/30/2018 11:06:14;0100;pbs_mom;Req;;Type 1 request received from root@10.10.10.85:15001, sock=1
11/30/2018 11:06:14;0100;pbs_mom;Req;;Type 3 request received from root@10.10.10.85:15001, sock=1
11/30/2018 11:06:14;0100;pbs_mom;Req;;Type 5 request received from root@10.10.10.85:15001, sock=1
11/30/2018 11:06:14;0008;pbs_mom;Job;33370.w003;nprocs:  355, cantstat:  1, nomem:  0, skipped:  0, cached:  0, max excluded PID:  0
11/30/2018 11:06:15;0006;pbs_python;Hook;pbs_python;############start 
 EXECJOB_LAUNCH  jobid=33370.w003  hostname=node81

11/30/2018 11:06:15;0008;pbs_mom;Job;33370.w003;Started, pid = 126986
11/30/2018 11:06:15;0080;pbs_mom;Job;33370.w003;task 00000001 terminated
11/30/2018 11:06:15;0008;pbs_mom;Job;33370.w003;Terminated
11/30/2018 11:06:15;0100;pbs_mom;Job;33370.w003;task 00000001 cput= 0:00:00
11/30/2018 11:06:15;0008;pbs_mom;Job;33370.w003;kill_job
11/30/2018 11:06:15;0100;pbs_mom;Job;33370.w003;node81 cput= 0:00:00 mem=908kb
11/30/2018 11:06:15;0100;pbs_mom;Job;33370.w003;node82.localdomain cput= 0:00:00 mem=0kb
11/30/2018 11:06:15;0008;pbs_mom;Job;33370.w003;no active tasks
11/30/2018 11:06:15;0100;pbs_mom;Job;33370.w003;Obit sent
11/30/2018 11:06:15;0100;pbs_mom;Req;;Type 54 request received from root@10.10.10.85:15001, sock=1
11/30/2018 11:06:15;0080;pbs_mom;Job;33370.w003;copy file request received
11/30/2018 11:06:15;0100;pbs_mom;Job;33370.w003;staged 2 items out over 0:00:00
11/30/2018 11:06:15;0008;pbs_mom;Job;33370.w003;no active tasks
11/30/2018 11:06:15;0100;pbs_mom;Req;;Type 6 request received from root@10.10.10.85:15001, sock=1
11/30/2018 11:06:15;0080;pbs_mom;Job;33370.w003;delete job request received
11/30/2018 11:06:15;0008;pbs_mom;Job;33370.w003;kill_job
11/30/2018 11:06:15;0002;pbs_mom;Svr;restrict_user;killed uid 1000 pid 127017(systemd)
11/30/2018 11:06:32;0100;pbs_mom;Req;;Type 1 request received from root@10.10.10.85:15001, sock=1
11/30/2018 11:06:32;0100;pbs_mom;Req;;Type 3 request received from root@10.10.10.85:15001, sock=1
11/30/2018 11:06:32;0100;pbs_mom;Req;;Type 5 request received from root@10.10.10.85:15001, sock=1
11/30/2018 11:06:32;0008;pbs_mom;Job;33371.w003;nprocs:  356, cantstat:  1, nomem:  0, skipped:  0, cached:  0, max excluded PID:  0
11/30/2018 11:06:33;0006;pbs_python;Hook;pbs_python;############start 
 EXECJOB_LAUNCH  jobid=33371.w003  hostname=node81

11/30/2018 11:06:33;0008;pbs_mom;Job;33371.w003;Started, pid = 127077
11/30/2018 11:06:33;0080;pbs_mom;Job;33371.w003;task 00000001 terminated
11/30/2018 11:06:33;0008;pbs_mom;Job;33371.w003;Terminated
11/30/2018 11:06:33;0100;pbs_mom;Job;33371.w003;task 00000001 cput= 0:00:00
11/30/2018 11:06:33;0008;pbs_mom;Job;33371.w003;kill_job
11/30/2018 11:06:33;0100;pbs_mom;Job;33371.w003;node81 cput= 0:00:00 mem=908kb
11/30/2018 11:06:33;0100;pbs_mom;Job;33371.w003;node82.localdomain cput= 0:00:00 mem=0kb
11/30/2018 11:06:33;0008;pbs_mom;Job;33371.w003;no active tasks
11/30/2018 11:06:33;0100;pbs_mom;Job;33371.w003;Obit sent
11/30/2018 11:06:33;0100;pbs_mom;Req;;Type 54 request received from root@10.10.10.85:15001, sock=1
11/30/2018 11:06:33;0080;pbs_mom;Job;33371.w003;copy file request received
11/30/2018 11:06:33;0100;pbs_mom;Job;33371.w003;staged 2 items out over 0:00:00
11/30/2018 11:06:33;0008;pbs_mom;Job;33371.w003;no active tasks
11/30/2018 11:06:33;0100;pbs_mom;Req;;Type 6 request received from root@10.10.10.85:15001, sock=1
11/30/2018 11:06:33;0080;pbs_mom;Job;33371.w003;delete job request received
11/30/2018 11:06:33;0008;pbs_mom;Job;33371.w003;kill_job

PS: There’s no EXECJOB_PROLOGUE items cause a typo in my hook script. But it was executed, for it shows up in /tmp/pbs*.log

1 Like

Thank you for your idea! Unfortunately according to my test any envs set in any hooks ( tested execjob_begin, execjob_prologue, execjob_launch, execjob_attach) can be modified in the job script.

The idea would be great if it is elementary user oriented, but not enough for security considerations.

1 Like

Integrating Intel MPI is usually indeed done by changing the bootstrap loader and setting it to pbs_tmrsh. Remote processes are then spawned by MoM, so the cgroup execjob_launch is automatically called for the processes to bind them to the correct cgroups.

If you’re worried about backdoors users might use: it’s possible to “wrap” ssh to call pbs_attach for anything in a PBS job. You can configure ssh to pass on PBS_JOBID if it exists and can configure sshd to accept it, and then you can configure an sshrc that detects PBS_JOBID and calls pbs_attach to attach the session to the job (which also moves all processes in the session in the cgroup, at least with a recent cgroup hook).

On failure of pbs_attach you simply kill the session. It’s even possible to have a list of users that can still login and to kill sessions of all other users except if they provide a valid PBS_JOBID which must be running on the node (otherwise the pbs_attach will fail).

Note that users can still bypass this by supplying their own sshrc, but then they’re no longer being naive.

Some sites are even more paranoid and simply disallow ssh access completely for “normal” users. In other words, if Intel MPI tries to use ssh it simply fails, so the users MUST use pbs_tmrsh as bootstrap loader. In other words, at these sites only MoM can be used to spawn processes for normal users on execution nodes.

Leaving ssh open is a risk onto itself, after all. You’re worried about PBSPro jobs consuming too much memory because they wouldn’t be in the correct cgroups, but if people just log in using ssh and start a memory hog they’re even more likely to cause trouble…

You are totally right. I’m finding a way to block ssh while allowing job submission and execution, but AFAIK it seems to be impossible.
Though, pbs_mom can kill unauthorized ssh access (at least asynchronized). I have to assume a user can’t allocate too much memory during the poll interval of pbs_mom.
It sounds very interesting to know that there’s something like sshrc. I’ll check it later and see if it meets my requirements.