I’m experimenting with placement sets using 16 x VMs to simulate a small, 3D hypercube environment:
Vertex 1 (switch 1) - nodes n101 & n102
Vertex 2 (switch 2) - nodes n201 & n202
Vertex 3 (switch 3) - nodes n301 & n302
Vertex 4 (switch 4) - nodes n401 & n402
Vertex 5 (switch 5) - nodes n501 & n502
Vertex 6 (switch 6) - nodes n601 & n602
Vertex 7 (switch 7) - nodes n701 & n702
Vertex 8 (switch 8) - nodes n801 & n802
The results so far are pleasing, with jobs being placed on nodes as expected (minimising the number of ‘hops’ between nodes assigned to the same job).
The thing is, although it’s working fine, I’m not sure that I’ve gone about configuring PBS the best way. The resulting config is quite large/complex for a simple, 16-node cluster. If I used the same methodology for a large cluster, it’d be far too complex to manage. I’m probably over-thinking it by trying to cover every (optimal) node combination.
I’d really appreciate some guidance from anyone who has experience with a similar (real-world) configuration.
If I understand your use case correctly then I think you should only need to create 1 host level resource, say “switch”, then just set resources_available.switch on the nodes to their respective switch number, say “s1”, “s2”, … “s8” … and set node_group_key to “switch”, this should make the scheduler group the nodes by each switch’s value and place the jobs on nodes with the same switch value
Thanks for the clarification, I think I was mistaken, I thought that you didn’t want jobs span across switches. Ya, I can’t think of a simpler way of doing this, I’ll let those more knowledgeable comment.
For a larger system, you need to be less precise. If you provide all the placement sets for perfect placement, the scheduler will get bogged down and will be too slow. While the exact hypercube is nice, it probably doesn’t make much difference to have perfect placement. Slightly imperfect placement is probably good enough.
Basically strip out some of the permutations of the larger sets.
…and it worked fine for jobs with 1-2 nodes. Jobs with >2 nodes remained in the queue (due to “set sched do_not_span_psets = true” and/or “set sched only_explicit_psets = true”). If I changed either of these to “false”, the job ran, but not on the desired nodes.
when do_not_span_psets is true, a job must fit within a placement set. Since the largest placement set you have set up is 2 nodes, the largest job can only request 2 nodes. You need to create larger placement sets for larger jobs.
Set up 2 node placement sets for nodes 0 hops away
Set up larger placement sets for nodes 1 hop away
etc
This way larger jobs will use the larger placement sets.
These larger placement sets are the ones I was talking about before where you do not create all permutations of them. You need to create all of the smaller ones, and less and less of the larger ones.
FWIW, We have a 12-D Infiniband hypercube system with 9 nodes at each vertex (SGI ICE-X). In our experience with real codes, the number of hops between nodes has less effect than does congestion on those links. And, except for a very few codes, the effect is negligible.
Simply ordering the nodes so the dimension 1 connections are adjacent, then those 1-D hypercubes are grouped into adjacent 2-D cubes, etc, gives an arrangement where grabbing a sequential list of nodes tends to get a fairly compact set in terms of highest dimension needed to encompass all nodes. If the job owns all the nodes in a low-dimension hypercube, there is little contention for the links from other jobs.
Yes, minimizing the number of hops tends to reduce the congestion, but this depends on how your network is routed. On the ICE-X systems, routing is done first by using minhop to find the list of shortest routes between vertices, then by selecting from those the path that traverses the network in dimension number order. (On the ICE-X the dimensions are ordered based on switch port number. E.g., if the inter-switch links use ports 10 - 19, then port 10 is dimension 1, port 11 is dimension 2, etc. [port numbers need not be successive])
Given this routing, if I have a job that occupies 2^N vertices and those vertices use only dimensions 1 - N links to talk to each other, then no other job will need to use any of my inter-switch links. So, any congestion is my own doing.
Now, this job’s placement also satisfies the goal of minimal total hops. In fact, I think, but cannot prove, that any placement on 2^N vertices that uses only links from N different dimensions will also be free from congestion from other jobs.
As to the numbering: Your numbering looks good to me, and is better than what is used on the ICE-X. Effectively, on the ICE-X, the vertices are numbered, starting at 0, such that directly connected vertices differ in their number by a power of 2. Thus, vertex 0 connects to 1, 2, 4, 8, …; 1 connects to 0, 3, 5, 9, … . This results in cases where selecting nodes sequentially from a list of free nodes can have more total hops than with your numbering scheme.
I did more thinking and am pretty sure I was wrong about the “any placement” part. To show it’s wrong, take a job that occupies opposite corners of the cube. If the cube is big enough, each of the two routes used between those vertices will pass through at least two connected vertices that are not part of the job. Those two vertices form a 2^1 cube, but their link includes traffic from outside the job.