CPU usage issues on multiple nodes

wakaka · June 6, 2024, 7:16am

I have a job with 19 threads. I applied for nodes on two different machines to execute the job, but the CPU of the second machine was not used at all, which is not the expected result. What is the reason? Hope for answers. Sincerely thank you.

I used the following format to apply for resources but it didn’t work

alexis.cousein · June 9, 2024, 9:30pm

You cannot run a pure pthreaded program across two hosts; pthreads are a shared memory parallel paradigm and two hosts do not share memory.

Run two separate cputest programs, using pbs_tmrsh for the second node.

If you want to run a multihost parallel application, consider using MPI. MPI libraries can be usually be configured to use pbs_tmrsh as remote spawner or even compiled to use the PBS/Torque TM API directly.

wakaka · June 11, 2024, 12:42pm

But the cputest program is in the NFS shared directory, which can be accessed by both machines. When submitting a job, according to the resources requested, shouldn’t the scheduler run the cputest program on both hosts and arrange their respective execution code segments and execution memory? When temporary shared memory is needed, it can still be created in the current shared directory (PBS_O_WORKDIR) so that communication, etc., can be managed uniformly by the scheduler.
Why do we still need to use mpi?

alexis.cousein · June 11, 2024, 7:35pm

No, that’s not how it works.

That’s not how pthreaded programs work at all. They cannot create “shared memory” using a shared filesystem.

If you think that’s what the NTPL Pthreads library should be doing, I suggest you contact the authors of said library. But I’m afraid you will not have an understanding ear there either.

I’m afraid this is really outside of PBSPro’s remit: you simply have fundamental misunderstandings about how parallel programming works. “Shared memory” means “memory that is under the management of a single Linux kernel, with the different processes mapping the same cache-coherent physical memory in their virtual address space”.

Note that for pthread programs it is even more stringent, all tasks share a single virtual address space and page translation table; the heap is shared and the stacks are thread-private.

Just having a shared filesystem doesn’t create a cache coherent NUMA memory space across two hosts.

Even if you want to run two unrelated pthreaded programs (one on each node) then you still need to start one on each node. That’s what pbs_tmrsh is for.

Even when you use MPI, BTW, remote processes are still created, but they are created by the MPI library when you call MPI_Init (which either uses ssh or something else like pbs_tmrsh or a tm_spawn PBS IFL library call).

A pthread library only creates threads on the local machine (more precisely, they create new tasks with a TID under the current task group with a shared task group ID (TGID)), with a shared heap and thread-private stack.

wakaka · June 12, 2024, 2:01am

I understand. Thank you very much for your explanation.
In addition, I would like to ask, I did not apply for all the CPU resources of a node, but when PBS scheduled, why all available CPUs were used, which exceeded the number I applied for?

adarsh · June 13, 2024, 1:10pm

When you request the resources (via qsub), it is for the scheduler to find the matching compute resource(s) to run your job. This would not lock the cores on the system and make sure the applicaton use requested amount of resources, however in this case the cputest application might use 1 or all or over subscribe cores and memory, this is not managed by the scheduler .

cgroups might be an answer.

wakaka · June 14, 2024, 2:47am

I get it. Thanks sincerely.

Topic		Replies	Views
Cannot run job on multiple nodes Users/Site Administrators	5	3574	March 21, 2019
Submitting job across multiple nodes Users/Site Administrators	1	154	March 14, 2025
Job not running Users/Site Administrators	1	1266	June 7, 2022
TM api on pbspro does not work for me? Users/Site Administrators	3	2970	October 27, 2016
Cannot run job on the execution host Users/Site Administrators	1	3825	April 24, 2017

CPU usage issues on multiple nodes

Related topics