CPU usage issues on multiple nodes

I have a job with 19 threads. I applied for nodes on two different machines to execute the job, but the CPU of the second machine was not used at all, which is not the expected result. What is the reason? Hope for answers. Sincerely thank you.




I used the following format to apply for resources but it didn’t work

You cannot run a pure pthreaded program across two hosts; pthreads are a shared memory parallel paradigm and two hosts do not share memory.

Run two separate cputest programs, using pbs_tmrsh for the second node.

If you want to run a multihost parallel application, consider using MPI. MPI libraries can be usually be configured to use pbs_tmrsh as remote spawner or even compiled to use the PBS/Torque TM API directly.

But the cputest program is in the NFS shared directory, which can be accessed by both machines. When submitting a job, according to the resources requested, shouldn’t the scheduler run the cputest program on both hosts and arrange their respective execution code segments and execution memory? When temporary shared memory is needed, it can still be created in the current shared directory (PBS_O_WORKDIR) so that communication, etc., can be managed uniformly by the scheduler.
Why do we still need to use mpi?

No, that’s not how it works.

That’s not how pthreaded programs work at all. They cannot create “shared memory” using a shared filesystem.

If you think that’s what the NTPL Pthreads library should be doing, I suggest you contact the authors of said library. But I’m afraid you will not have an understanding ear there either.

I’m afraid this is really outside of PBSPro’s remit: you simply have fundamental misunderstandings about how parallel programming works. “Shared memory” means “memory that is under the management of a single Linux kernel, with the different processes mapping the same cache-coherent physical memory in their virtual address space”.

Note that for pthread programs it is even more stringent, all tasks share a single virtual address space and page translation table; the heap is shared and the stacks are thread-private.

Just having a shared filesystem doesn’t create a cache coherent NUMA memory space across two hosts.

Even if you want to run two unrelated pthreaded programs (one on each node) then you still need to start one on each node. That’s what pbs_tmrsh is for.

Even when you use MPI, BTW, remote processes are still created, but they are created by the MPI library when you call MPI_Init (which either uses ssh or something else like pbs_tmrsh or a tm_spawn PBS IFL library call).

A pthread library only creates threads on the local machine (more precisely, they create new tasks with a TID under the current task group with a shared task group ID (TGID)), with a shared heap and thread-private stack.

I understand. Thank you very much for your explanation.
In addition, I would like to ask, I did not apply for all the CPU resources of a node, but when PBS scheduled, why all available CPUs were used, which exceeded the number I applied for?

When you request the resources (via qsub), it is for the scheduler to find the matching compute resource(s) to run your job. This would not lock the cores on the system and make sure the applicaton use requested amount of resources, however in this case the cputest application might use 1 or all or over subscribe cores and memory, this is not managed by the scheduler .

cgroups might be an answer.

I get it. Thanks sincerely.