Placement sets with hypercube network topology

Hi all,

I’m experimenting with placement sets using 16 x VMs to simulate a small, 3D hypercube environment:

  • Vertex 1 (switch 1) - nodes n101 & n102
  • Vertex 2 (switch 2) - nodes n201 & n202
  • Vertex 3 (switch 3) - nodes n301 & n302
  • Vertex 4 (switch 4) - nodes n401 & n402
  • Vertex 5 (switch 5) - nodes n501 & n502
  • Vertex 6 (switch 6) - nodes n601 & n602
  • Vertex 7 (switch 7) - nodes n701 & n702
  • Vertex 8 (switch 8) - nodes n801 & n802

image

The results so far are pleasing, with jobs being placed on nodes as expected (minimising the number of ‘hops’ between nodes assigned to the same job).

The thing is, although it’s working fine, I’m not sure that I’ve gone about configuring PBS the best way. The resulting config is quite large/complex for a simple, 16-node cluster. If I used the same methodology for a large cluster, it’d be far too complex to manage. I’m probably over-thinking it by trying to cover every (optimal) node combination.

I’d really appreciate some guidance from anyone who has experience with a similar (real-world) configuration.

TIA,

Craig



If I understand your use case correctly then I think you should only need to create 1 host level resource, say “switch”, then just set resources_available.switch on the nodes to their respective switch number, say “s1”, “s2”, … “s8” … and set node_group_key to “switch”, this should make the scheduler group the nodes by each switch’s value and place the jobs on nodes with the same switch value

Thanks very much for your response agrawalravi90,

I’ll give that a try and let you know how it works out. In the meantime, I thought I’d add an example of how the current config is working.

If I take two nodes offline (n201 & n302)…

…then submit a job that requires four nodes…

…PBS chooses two nodes from two adjacent vertices/switches (see original diagram above).

This is the exact behaviour that I want; minimum hops between nodes.

Regards,

Craig

Thanks for the clarification, I think I was mistaken, I thought that you didn’t want jobs span across switches. Ya, I can’t think of a simpler way of doing this, I’ll let those more knowledgeable comment.

For a larger system, you need to be less precise. If you provide all the placement sets for perfect placement, the scheduler will get bogged down and will be too slow. While the exact hypercube is nice, it probably doesn’t make much difference to have perfect placement. Slightly imperfect placement is probably good enough.

Basically strip out some of the permutations of the larger sets.

Bhroam

Thanks Ravi,

Yeah, I tried this config…

…and it worked fine for jobs with 1-2 nodes. Jobs with >2 nodes remained in the queue (due to “set sched do_not_span_psets = true” and/or “set sched only_explicit_psets = true”). If I changed either of these to “false”, the job ran, but not on the desired nodes.

Regards,

Chris

ya that’s what I’d expect, I had misunderstood your use case.

when do_not_span_psets is true, a job must fit within a placement set. Since the largest placement set you have set up is 2 nodes, the largest job can only request 2 nodes. You need to create larger placement sets for larger jobs.
Set up 2 node placement sets for nodes 0 hops away
Set up larger placement sets for nodes 1 hop away
etc

This way larger jobs will use the larger placement sets.

These larger placement sets are the ones I was talking about before where you do not create all permutations of them. You need to create all of the smaller ones, and less and less of the larger ones.

Bhroam

Thanks very much Bhroam,

I took your advice and went for a much simpler config…

Sure, it’s not perfect, but it works how I want it to in a majority of cases.

Thanks again.

Regards,

Craig

FWIW, We have a 12-D Infiniband hypercube system with 9 nodes at each vertex (SGI ICE-X). In our experience with real codes, the number of hops between nodes has less effect than does congestion on those links. And, except for a very few codes, the effect is negligible.

Simply ordering the nodes so the dimension 1 connections are adjacent, then those 1-D hypercubes are grouped into adjacent 2-D cubes, etc, gives an arrangement where grabbing a sequential list of nodes tends to get a fairly compact set in terms of highest dimension needed to encompass all nodes. If the job owns all the nodes in a low-dimension hypercube, there is little contention for the links from other jobs.

Thanks very much Dale,

I take your point re number of hops and congestion. Wouldn’t minimising the number of hops be a good way of reducing the likelihood of congestion?

Re your second paragraph, could you explain that in a little more detail please? I take it to mean something like…

nodes 101-109 on vertex 1
nodes 201-209 on vertex 2


nodes 701-709 on vertex 7
nodes 801-809 on vertex 8

…with the vertices arranged so that the next sequential vertex is adjacent…

image

Regards,

Craig

Yes, minimizing the number of hops tends to reduce the congestion, but this depends on how your network is routed. On the ICE-X systems, routing is done first by using minhop to find the list of shortest routes between vertices, then by selecting from those the path that traverses the network in dimension number order. (On the ICE-X the dimensions are ordered based on switch port number. E.g., if the inter-switch links use ports 10 - 19, then port 10 is dimension 1, port 11 is dimension 2, etc. [port numbers need not be successive])

Given this routing, if I have a job that occupies 2^N vertices and those vertices use only dimensions 1 - N links to talk to each other, then no other job will need to use any of my inter-switch links. So, any congestion is my own doing.

Now, this job’s placement also satisfies the goal of minimal total hops. In fact, I think, but cannot prove, that any placement on 2^N vertices that uses only links from N different dimensions will also be free from congestion from other jobs.

As to the numbering: Your numbering looks good to me, and is better than what is used on the ICE-X. Effectively, on the ICE-X, the vertices are numbered, starting at 0, such that directly connected vertices differ in their number by a power of 2. Thus, vertex 0 connects to 1, 2, 4, 8, …; 1 connects to 0, 3, 5, 9, … . This results in cases where selecting nodes sequentially from a list of free nodes can have more total hops than with your numbering scheme.

I did more thinking and am pretty sure I was wrong about the “any placement” part. To show it’s wrong, take a job that occupies opposite corners of the cube. If the cube is big enough, each of the two routes used between those vertices will pass through at least two connected vertices that are not part of the job. Those two vertices form a 2^1 cube, but their link includes traffic from outside the job.