Recently I got a NUMA system (32 sockets but not an SGI UV). I noticed my MPI job which is smaller than a NUMA size is broken into different nodes, although there are still empty nodes. The performance is suffering. My question is:
- lscpu and numastat both reported 16 nodes, is it considered by the scheduler?
- the PBSPro BigBook referred to Cpusets. Is it the same as cpuset (/dev/cpuset)?
- Should I establish 16 vnodes? if yes, what should I do?
(Running CentOS 7.3, from the rebuilded 14.1.0 srpm without modification)
The scheduler places jobs based on the PBS vnodes of the system. From your post it sounds like you only have one vnode. The scheduler will look at the entire system as one large pool of resources instead of smaller chunks of resources.
While SGI systems provide a topology file we create vnodes from, you don’t have one. You’ll have to create vnodes yourself. Create yourself a vnode def file with one vnode per NUMA node.
You can go further if you want and create placement sets. This will allow the scheduler to place jobs closer together.
I hope this helps,