Enable cgroup with Intel MPI mpirun on sister exec nodes

runapp · November 29, 2018, 12:48pm

Hi, I’m currently working on a hook to bind job processes to cgroup on PBS 14.1.2. I know that there’s a cgroup hook on v18 so I checked it carefully and found it using execjob_attach . So that I setup a test hook writing out only the job id, to ensure that the hook is executed on each sister exec nodes.
Unfortunately I saw a failure on it. Only the primary exec host outputs the log. I tried to add execjob_launch and execjob_prologue events to the hook but no help.
I’m using Intel MPI, installed (as a component of) by official installer of Intel Parallel Studio and mpirun -machinefile to run on multiple hosts. There’s no special settings to integrate it with PBS, which I think is where the problem lies.
I wonder what’s the best practice to achieve my goal. The final method I’ve thought of is to grab vnode_list from execjob_launch event and notify a custom daemon running on each computing node, then do the cgroup job in the daemon. As for it needs extra effort for writing the daemon, I’d like to know whether there are any better ways.

adarsh · November 29, 2018, 3:29pm

would it be possible to share the purpose of using cgroups hook here ?

Could you please disable cgroups hook and test a sample pbs script , and check whether it is executing on the sister nodes.

#PBS -N pbs-intelmpi
#PBS -l select=2:ncpus=2:mpiprocs=2
#PBS -l place=scatter
I_MPI_HYDRA_BOOTSTRAP=rsh
I_MPI_HYDRA_BOOTSTRAP_EXEC=/opt/pbs/bin/pbs_tmrsh
/opt/intel/impi/2018.3.222/bin64/mpirun -np 4 /bin/hostname -machinefile $PBS_NODEFILE

#$PBS_NODEFILE is dynamically created by PBS Professional, it is not required as recent version of Intel MPI are integrated with PBS Pro.

2 . Now enable the cgroup hook and test the same job script.

Search the keyword PBS_NODEFILE in the below document for more information
https://www.pbsworks.com/pdfs/PBS18.2_BigBook.pdf

runapp · November 29, 2018, 3:57pm

Hi adarsh, I’m using cgroup to limit the memory usage of certain jobs.

Your job script’s output was definitely something like

nodeX
nodeX
nodeY
nodeY

I’ve confirmed it right now.

I know that MPI integration is done by setting envs that makes MPI(mpirun) takes pbs’s rsh or something like that. However my goal is to prevent user’s job from affecting other jobs running on the same node. As written in your script, a user can simply left these envs out, or even explicitly unset them, to bypass the restriction.

So I think what I wantn is a totally “transparent” solution for the end users, and thus the policy enforced. I do know that mpirun starts sister-node-processes by rsh or ssh, so that PBS can do little if mpirun refuses to notify pbs mom on these sister nodes. Is there a chance that by setting global bashrc (I can ensure users to use bash as their default shell) and take the magic at that stage?

PS: thank you again for your clarification about PBS_NODEFILE. I know it, and it performs well so that it’s not related to the question in my mind.

adarsh · November 29, 2018, 4:43pm

Note: that it is not remote invocation of commands using SSH, it is tightly controlled and manageed by PBS and everything is accounted for with respect to cput, memory , etc

Please check the below figure in the High-performance Computing (HPC) and Cloud Solutions | Altair
Figure 5-1: PBS knows about processes on vnodes 2 and 3, because pbs_tmrsh talks directly to pbs_mom, and pbs_mom starts the processes on vnodes 2 and 3

Please check : $enforce mem in the High-performance Computing (HPC) and Cloud Solutions | Altair , this might suffice your requirement.

Also if CGroups hook is enabled by default , the memory confinement based on the qsub request should be managed by default, there is no additional configuration that needs to be done for a Intel MPI job.

runapp · November 29, 2018, 5:56pm

$enforce mem is set on all vnodes. But according to tests, PBS’s poll-then-kill method is not enough for my situation. It has been reported user’s program allocated huge amount RAM in a sudden, causing (sometimes) other process all died, even if pbs_mom monitoring the memory usage. So I finally decided to make use of cgroup for that.

I noticed that in Admin Book it writes only if processes are started via pbs_tmrsh on vnode 2 or 3 can pbs_mom knows them. And I can’t find a way to enforce mpirun to use pbs_tmrsh, rather than plain ssh.

PS:I’m on PBS v14.1.2 so that the cgroup hook is not there. You can say that I’m trying to “backport” it .

PPS: Check https://pastebin.com/nQq0d24L for the result of job script you posted before. Wired that sister node didn’t run any hook code.
mpirun --version:

Intel(R) MPI Library for Linux* OS, Version 2019 Build 20180829 (id: 15f5d6c0c)
Copyright 2003-2018, Intel Corporation.

adarsh · November 29, 2018, 7:05pm

thank you for the information

The below lines in the PBS Script does the job:
I_MPI_HYDRA_BOOTSTRAP=rsh
I_MPI_HYDRA_BOOTSTRAP_EXEC=/opt/pbs/bin/pbs_tmrsh

please read the pbs user guide

ssh node$i cat /tmp/pbs* # please cat it to file on a shared location accessible by all the nodes

Updated:

check the mom logs on the sister nodes, to see whether any mom hooks were executed

smgoosen · November 29, 2018, 9:31pm

You could use an execjob_begin hook to set the environment variables, they would get propagated to the job’s environment. I’m not sure if that would override the user’s settings in the script?

runapp · November 30, 2018, 3:11am

I’ve checked it carefully, using the following python code in the hook:

hostname= subprocess.check_output('hostname')

f = open('/tmp/pbs-hook-cgroup.log', 'a')
try:
    head="############start \n {}  jobid={}  hostname={}".format(EVENT_NAMES[e.type],e.job.id,hostname)
    pbs.logmsg(pbs.LOG_WARNING,head)
    f.write(head+"\n")
    f.flush()
...

There’s no system level log of the hook in mom_logs so I added the pbs.logmsg as well as the file logging in /tmp, even after running qmgr -c "set hook cgroup debug = true". I used tmp rather than a shared location because I want to know exactly on which host was the hook executed.

The result remains the same: whether I set I_MPI_XX envs or not, hooks are only executed on primary vnode(vnode1). Only on primary vnode can I get the head string in mom_logs and /tmp/pbs-cgroup-hook.log.

mom_logs on vnode2 looks like (I’ve run “echo > xxx” to clear the log before job submission):

11/30/2018 11:06:05;0100;pbs_mom;Req;;Type 85 request received from root@10.10.10.85:15001, sock=1
11/30/2018 11:06:05;0080;pbs_mom;Hook;cgroup.HK;copy hook-related file request received
11/30/2018 11:06:14;0008;pbs_mom;Job;33370.w003;JOIN_JOB as node 1
11/30/2018 11:06:15;0008;pbs_mom;Job;33370.w003;KILL_JOB received
11/30/2018 11:06:15;0008;pbs_mom;Job;33370.w003;kill_job
11/30/2018 11:06:15;0008;pbs_mom;Job;33370.w003;DELETE_JOB received
11/30/2018 11:06:15;0008;pbs_mom;Job;33370.w003;kill_job
11/30/2018 11:06:32;0008;pbs_mom;Job;33371.w003;JOIN_JOB as node 1
11/30/2018 11:06:33;0008;pbs_mom;Job;33371.w003;KILL_JOB received
11/30/2018 11:06:33;0008;pbs_mom;Job;33371.w003;kill_job
11/30/2018 11:06:33;0008;pbs_mom;Job;33371.w003;DELETE_JOB received
11/30/2018 11:06:33;0008;pbs_mom;Job;33371.w003;kill_job

while on primary exec node the log is:

11/30/2018 11:06:05;0100;pbs_mom;Req;;Type 85 request received from root@10.10.10.85:15001, sock=1
11/30/2018 11:06:05;0080;pbs_mom;Hook;cgroup.HK;copy hook-related file request received
11/30/2018 11:06:14;0100;pbs_mom;Req;;Type 1 request received from root@10.10.10.85:15001, sock=1
11/30/2018 11:06:14;0100;pbs_mom;Req;;Type 3 request received from root@10.10.10.85:15001, sock=1
11/30/2018 11:06:14;0100;pbs_mom;Req;;Type 5 request received from root@10.10.10.85:15001, sock=1
11/30/2018 11:06:14;0008;pbs_mom;Job;33370.w003;nprocs:  355, cantstat:  1, nomem:  0, skipped:  0, cached:  0, max excluded PID:  0
11/30/2018 11:06:15;0006;pbs_python;Hook;pbs_python;############start 
 EXECJOB_LAUNCH  jobid=33370.w003  hostname=node81

11/30/2018 11:06:15;0008;pbs_mom;Job;33370.w003;Started, pid = 126986
11/30/2018 11:06:15;0080;pbs_mom;Job;33370.w003;task 00000001 terminated
11/30/2018 11:06:15;0008;pbs_mom;Job;33370.w003;Terminated
11/30/2018 11:06:15;0100;pbs_mom;Job;33370.w003;task 00000001 cput= 0:00:00
11/30/2018 11:06:15;0008;pbs_mom;Job;33370.w003;kill_job
11/30/2018 11:06:15;0100;pbs_mom;Job;33370.w003;node81 cput= 0:00:00 mem=908kb
11/30/2018 11:06:15;0100;pbs_mom;Job;33370.w003;node82.localdomain cput= 0:00:00 mem=0kb
11/30/2018 11:06:15;0008;pbs_mom;Job;33370.w003;no active tasks
11/30/2018 11:06:15;0100;pbs_mom;Job;33370.w003;Obit sent
11/30/2018 11:06:15;0100;pbs_mom;Req;;Type 54 request received from root@10.10.10.85:15001, sock=1
11/30/2018 11:06:15;0080;pbs_mom;Job;33370.w003;copy file request received
11/30/2018 11:06:15;0100;pbs_mom;Job;33370.w003;staged 2 items out over 0:00:00
11/30/2018 11:06:15;0008;pbs_mom;Job;33370.w003;no active tasks
11/30/2018 11:06:15;0100;pbs_mom;Req;;Type 6 request received from root@10.10.10.85:15001, sock=1
11/30/2018 11:06:15;0080;pbs_mom;Job;33370.w003;delete job request received
11/30/2018 11:06:15;0008;pbs_mom;Job;33370.w003;kill_job
11/30/2018 11:06:15;0002;pbs_mom;Svr;restrict_user;killed uid 1000 pid 127017(systemd)
11/30/2018 11:06:32;0100;pbs_mom;Req;;Type 1 request received from root@10.10.10.85:15001, sock=1
11/30/2018 11:06:32;0100;pbs_mom;Req;;Type 3 request received from root@10.10.10.85:15001, sock=1
11/30/2018 11:06:32;0100;pbs_mom;Req;;Type 5 request received from root@10.10.10.85:15001, sock=1
11/30/2018 11:06:32;0008;pbs_mom;Job;33371.w003;nprocs:  356, cantstat:  1, nomem:  0, skipped:  0, cached:  0, max excluded PID:  0
11/30/2018 11:06:33;0006;pbs_python;Hook;pbs_python;############start 
 EXECJOB_LAUNCH  jobid=33371.w003  hostname=node81

11/30/2018 11:06:33;0008;pbs_mom;Job;33371.w003;Started, pid = 127077
11/30/2018 11:06:33;0080;pbs_mom;Job;33371.w003;task 00000001 terminated
11/30/2018 11:06:33;0008;pbs_mom;Job;33371.w003;Terminated
11/30/2018 11:06:33;0100;pbs_mom;Job;33371.w003;task 00000001 cput= 0:00:00
11/30/2018 11:06:33;0008;pbs_mom;Job;33371.w003;kill_job
11/30/2018 11:06:33;0100;pbs_mom;Job;33371.w003;node81 cput= 0:00:00 mem=908kb
11/30/2018 11:06:33;0100;pbs_mom;Job;33371.w003;node82.localdomain cput= 0:00:00 mem=0kb
11/30/2018 11:06:33;0008;pbs_mom;Job;33371.w003;no active tasks
11/30/2018 11:06:33;0100;pbs_mom;Job;33371.w003;Obit sent
11/30/2018 11:06:33;0100;pbs_mom;Req;;Type 54 request received from root@10.10.10.85:15001, sock=1
11/30/2018 11:06:33;0080;pbs_mom;Job;33371.w003;copy file request received
11/30/2018 11:06:33;0100;pbs_mom;Job;33371.w003;staged 2 items out over 0:00:00
11/30/2018 11:06:33;0008;pbs_mom;Job;33371.w003;no active tasks
11/30/2018 11:06:33;0100;pbs_mom;Req;;Type 6 request received from root@10.10.10.85:15001, sock=1
11/30/2018 11:06:33;0080;pbs_mom;Job;33371.w003;delete job request received
11/30/2018 11:06:33;0008;pbs_mom;Job;33371.w003;kill_job

PS: There’s no EXECJOB_PROLOGUE items cause a typo in my hook script. But it was executed, for it shows up in /tmp/pbs*.log

runapp · November 30, 2018, 3:38am

Thank you for your idea! Unfortunately according to my test any envs set in any hooks ( tested execjob_begin, execjob_prologue, execjob_launch, execjob_attach) can be modified in the job script.

The idea would be great if it is elementary user oriented, but not enough for security considerations.

alexis.cousein · January 18, 2019, 9:47am

Integrating Intel MPI is usually indeed done by changing the bootstrap loader and setting it to pbs_tmrsh. Remote processes are then spawned by MoM, so the cgroup execjob_launch is automatically called for the processes to bind them to the correct cgroups.

If you’re worried about backdoors users might use: it’s possible to “wrap” ssh to call pbs_attach for anything in a PBS job. You can configure ssh to pass on PBS_JOBID if it exists and can configure sshd to accept it, and then you can configure an sshrc that detects PBS_JOBID and calls pbs_attach to attach the session to the job (which also moves all processes in the session in the cgroup, at least with a recent cgroup hook).

On failure of pbs_attach you simply kill the session. It’s even possible to have a list of users that can still login and to kill sessions of all other users except if they provide a valid PBS_JOBID which must be running on the node (otherwise the pbs_attach will fail).

Note that users can still bypass this by supplying their own sshrc, but then they’re no longer being naive.

Some sites are even more paranoid and simply disallow ssh access completely for “normal” users. In other words, if Intel MPI tries to use ssh it simply fails, so the users MUST use pbs_tmrsh as bootstrap loader. In other words, at these sites only MoM can be used to spawn processes for normal users on execution nodes.

Leaving ssh open is a risk onto itself, after all. You’re worried about PBSPro jobs consuming too much memory because they wouldn’t be in the correct cgroups, but if people just log in using ssh and start a memory hog they’re even more likely to cause trouble…

runapp · March 14, 2019, 11:56am

You are totally right. I’m finding a way to block ssh while allowing job submission and execution, but AFAIK it seems to be impossible.
Though, pbs_mom can kill unauthorized ssh access (at least asynchronized). I have to assume a user can’t allocate too much memory during the poll interval of pbs_mom.
It sounds very interesting to know that there’s something like sshrc. I’ll check it later and see if it meets my requirements.

Topic		Replies	Views
PBS-server not running Developers	31	7056	October 20, 2022
How to write a script for a program run in two hosts? Users/Site Administrators	21	4754	January 17, 2019
Disallow ssh access to the node where a job is running in PBS (even if you are the owner of the job) Users/Site Administrators	5	1144	July 18, 2023
Job performance is lower when scheduled through pbs Users/Site Administrators	19	1868	March 18, 2022
Cgroup error causing suspended jobs Users/Site Administrators	17	3988	October 18, 2018

Enable cgroup with Intel MPI mpirun on sister exec nodes

Related topics