qsub -l select=2:ncpus=2:mpiprocs=2:ngpus=1 -l place=scatter -I < the last -I is capital I for ice cream and press return, then you will get a console on the remote node , if job runs>
echo $PBS_NODEFILE
cat $PBS_NODEFILE #you can copy the contents of this file to another file and use it for horovod framework
qsub -l select=1:ncpus=1:ngpus=1+1:ncpus=1:ngpus=2 -I < same as above >
echo $PBS_NODEFILE
cat $PBS_NODEFILE #you can play with the contents and create a format required for horovod framework
Hope this helps, otherwise please share your script
Hi, and thank you, for answer!
It is very useful, but I already know about $PBS_NODEFILE variable. This var contains only nodes.
I looking for approach for explicitly get list of allocated GPUs.
Please download the PBSPro 2020.1 guides (since it contains the most recent documentation of the cgroup hook) and the cgroup hook on the OpenPBS GitHub repository. That allows you to discover on which sockets the GPUs are, make vnodes for each socket (by enabling vnode_per_numa_node), publish how many GPUs there are; the hook will assign GPUs and will also populate CUDA_VISIBLE_DEVICES for each process (even differently for different hosts) for processes spawned via the Task Management API (i.e. spawned by MoM). Other processes will be able to read a “.env” file in the same directory as the nodefile to see the setting.
It is even possible to ensure that you have device isolation, i.e. that any process attached to the cgroups for the job only sees the relevant GPUs (i.e. even nvidia-smi will not see the “wrong” GPUs). It’s a bit tricky to set up because you have to list all the other devices the job may also need that are NOT the GPUs.
Note: there’s a typo in the documentation; vntype files (should you use them to do different things on different node types) are not in $PBS_HOME but $PBS_HOME/mom_priv.
The current hook in “master” was, iirc, meant to be backward compatible (even though it only became officially part of PBSPro OSS in 18.x).
In its current state it should be compatible with older Python 2.x versions except for a few constructs like the syntax for naming exceptions (which is compatible only with Python 2.7 and later).
Even if you build against a Python 2.5-based PBSPro version fixing these is going to be easier than fixing all the corner cases in older hooks.
Why would you want to do that? If it’s job busy that means there are no CPUs local to the vnode that the GPU is on. That means it’s impossible to run a GPU job with a decent speed.