MPI Proc Weirdness [updated from OMP_NUM_THREADS Mis-set?]

This worked a couple of days ago but now I only get 4 lines of output rather than 16. The other guy I work with says there were no changes, but something is different.

Now, though 16 ncpus are set aside (the tracejob output confirms it) OMP seems “mis-set”

What dumb thing could I be missing or where can I look? This first one is also the only one to use 4 nodes even when in other examples select=4 is set.

I used two different script to verify, openmpi4 with python and mpi4pi and a dirtt simple c++ proggie I wrote a while ago to test the same thing.

They are called within the job script like this when I use them
mpiexec --mca btl ‘^openib’ python python-mpi-hello.py
mpiexec --mca btl ‘^openib’ mpi-hello

Server default is scatter.

Directives

#PBS -l nodes=4:ncpus=4

#PBS -l mem=10gb

Env Value: OMP_NUM_THREADS = 4 (16 CPUS set aside)

tracejob
hpc-compute01:ncpus=4:mem=2621440kb
hpc-compute02:ncpus=4:mem=2621440kb
hpc-compute03:ncpus=4:mem=2621440kb
hpc-compute04:ncpus=4:mem=2621440kb

Job Output
process rank 0, of 4, running on hpc-compute01
process rank 2, of 4, running on hpc-compute03
process rank 1, of 4, running on hpc-compute02
process rank 3, of 4, running on hpc-compute04

For giggles, I tried other select statements and tests as well, and in each case, 16 CPUS are set aide by PBS but OMP_NUM_THREADS is off.

except in one case where it was right, but in that case only 1 was used (ugg). I avoided cluttering the post with the output hoping this might be enough to point me in the right direction.

Thanks

What value did you think OMP_NUM_THREADS should have? Remember that threads can share only within a node. So, when you ask for X nodes, each with 4 CPUs, you should get OMP_NUM_THREADS=4 on each of the X nodes.

good point, but I am only getting output from 4. It used to be 16 (or 20 or whatever multiple I set so to speak)

It could be a misunderstanding on my part, but I dug and came across that after the behavior showed up. So if it is not mis-set, then where could I dig to find the change in behavior?

Any thoughts?

edit: (cause then I am only getting 1 per node)

Maybe I should have called mpiprocs weirdness? (wait, I can update the subject? OK. New Title)

Left this detail out, PBS 2022.1

Did some more testing this morning. The basic first select is where I used to get 16 thread references across 4 nodes without specifying mpiprocs or -np (-l nodes=4:ncpus=4 ).

In case of a file lock thing (just in case, there have been no errors) each writes a sep file listing its node, rank, and size, and it matches what I have below and my standard out.

Not sure anyone can solve this per se, but clue me where to look to see what has changed or point me a direction or tell me where I have been thick headed is great.

.#PBS -l nodes=4:ncpus=4 uses to get me to 16 thread references in hellow world output file. It does not any longer. But -l nodes=4:ppn=4 does. (hello rank, size, node output)

Maybe I am second guessing myself too much but I thought in a simple spec ncpus==ppn (OMP threads is being set to the ncpu value and pbs is allocating the right number of cores)

PBS is allocating 16 CPUS according to tracejob in all cases

#PBS -l select=4:ncpus=4
Output 4 references / same node
Tracejob: 4 ncpu chunk x 4 on same host
Total NCPUs = 16
Node File same node 4 times

#PBS -l nodes=4:ncpus=4
4 references / 4 nodes
Tracejob: 4 ncpu chunk x 1 on each node
Total NCPUs = 16
Node File each node once

#PBS -l select=4:ncpus=4:mpiprocs=4
16 references / same node
Tracejob: 4 ncpu chunk x 4 on same host
Total NCPUs = 16
Node File same node 16 times

#PBS -l nodes=4:ppn=4
16 references / 4 nodes
Tracejob: 4 ncpu chunk x 4 on same host
Total NCPUs = 16
Node 4 nodes * 4 (each has 4 references)

First, as an aside, your example below indicates that placement was not scatter. With scatter, each chunk (“select=4:”) should come from a different node.

#PBS -l select=4:ncpus=4:mpiprocs=4
16 references / same node
Tracejob: 4 ncpu chunk x 4 on same host
Total NCPUs = 16
Node File same node 16 times

I’m not entirely sure what is going on, but you can simplify your life by always using the select= format, rather than the deprecated nodes=. Also, it would be clearer for debugging if you didn’t ask for the same number of nodes and CPUs. Say ask for 3 nodes each with 4 CPUs.

So, going with select, you can specify exactly what you want.

# PBS -l select=X:ncpus=Y:mpiprocs=Z:ompthreads=W
# PBS -l place=scatter

Here, you will have allocations on X distinct nodes, each with Y CPUs. You can then sub-allocate those CPUs into Z MPI processes, each having W threads.

Now, only some combinations of (X, Y, Z, W) make sense. For example the per chunk number of CPUs used (Z * W) should usually be less than or equal to the total CPUs allocated to the chunk (Y).

Example. Say each node has 12 CPUs. Also say your application runs best on just 2 threads (more threads runs it with faster walltime, but with lower efficiency). To fully use one node, you would specify

#PBS -l select=1:ncpus=12:mpiprocs=6:ompthreads=2

If your problem size gets bigger so that you need, say, 5 nodes of computation, you specify that with:

#PBS -l select=5:ncpus=12:mpiprocs=6:ompthreads=2
#PBS -l place=scatter

The things you changed were the number of chunks and that you wanted the chunks scattered to different nodes.

Now, say the problem gets bigger in memory needs per MPI process such that you can no longer fit 6 MPI processes into a node, but 4 will fit. You would use something like

#PBS -l select=5:ncpus=8:mpiprocs=4:ompthreads=2:mem=12gb
#PBS -l place=scatter

Here you let PBS know how much memory you need per chunk (4 MPI processes @ 3gb each). Because you have only 4 processes, you need only ncpus=8.

However, if it is likely the 4 remaining CPUs on each node would otherwise sit idle, you could put them to use by bumping the thread count for each process.

#PBS -l select=5:ncpus=12:mpiprocs=4:ompthreads=3:mem=12gb
#PBS -l place=scatter

Thank you… far far clearer that I find in the docs… I might just be printing that out and using this as reference. I do not quite find that as intuitive (yet?) as you do.

Thinking more on what you said and digging some more, looks like the nodes syntax might just be default scatter (place “cannot be used with nodes” error message), which makes sense if you are asking for scattered resources.

Resource_List.place=scatter Resource_List.select=4:ncpus=4
resources_used.ncpus=16
exec_host=compute01/04+compute02/04+compute03/04+compute04/04

I opened up a support ticket and really not sure if I have missed something painfully obvious, but that was a great reference post. (and will update if I get anywhere)

Not sure what is going on here either. But I am beginning to think the idle cores are what are tripping me up and the rest is minutiae that will go away once that gets worked out.

there was a fast and easy expected response on that (just a quick demo and test) and it ain’t happening anymore.

strong text

FWIW, the BACKWARD COMPATIBILITY section of the pbs_resources man page goes into detail on how -l nodes= or ncpus= specifications are interpreted as select= values. From my reading, it matches what you show for your examples.

I’ve seen it, but I have to tell you the truth… I am wondering now if I got twisted around at some point between ppn, ncpu, and mpiprocs. if I was using ppn and not seeing it… and switched to ncpus at some point? Completely second guessing myself this morning.

Old Syntax
so ppn=x is converted to ncpu=x:mpiprocs=x but ncpu=y is ncpu=y:mpiprocs=1 (and I think that examples is missing from the docs). They are not ncpu=ppn or ppn=ncpu exactly.

so
-l nodes=2:ppn=4 converts to -l select=2:ncpus=4:mpiprocs=4
whereas
-l nodes=2:ncpus=4 basically converts to a select statement where mpiprocs=1

If you have not yet seen it, you might try Chapter 4, “Allocating Resources & Placing Jobs”, in the PBS Professional User’s Guide.

ty… yeah, been through that. I actually equally hate and love their docs for depth and occasional lack of clarity (hate that the hooks doc uses real old school Python syntax too and that they toss an exception to get out cleanly).

Just to summarize my own drivel,

old syntax -l nodes=4:ncpus=4 supposed to translate to?
And
OMP_NUM_THREADS equals NCPUS, but shouldn’t ncpus also be mpiprocs when NOT specified? (and I thought I saw the line in the docs and cant now ncpus==mpiprocs)

(this might be the nut of where I got turned around a subtle misunderstanding there)

What I am seeing is

-l nodes=4:ncpus=4 looks like -l select=4:ncpus=4:mpiprocs=1 and OMP_NUM_THREADS is set to 4.

4 CPUs per chunk/node and 16 cpus total.

Is that correct behavior ? (think my twist was expecting it to be -l select=4:ncpus=4:mpiprocs=4)

The pbs_resources man page says that the default for mpiprocs is 1 when ncpus > 0. So, this is as per spec.

1 Like

Arg… that also in the manual somewhere and I think I read it as ncpus to mpiprocs 1:1 (but now I am thinking it is ppn… so that -l nodes=4:ppn2 converts to -l select=4:ncpus=4:mpiprocs=4 -l place=scatter

That is the behavior I am seeing.