Memory usage across nodes

Manan · April 25, 2021, 10:12am

I am a new user to PBS

I have a 12 node cluster with 376 GB of memory on each node but I want to run a parallel job that requires 1TB of memory. So is there a way that PBS can give a argument such that I can Use the memory of other nodes parallely and run my job of 1TB.

adarsh · April 25, 2021, 7:18pm

Please try this
qsub -l select=3:ncpus=16:mem=350gb:mpiprocs=16 -l place=scatter -- /bin/sleep 1000

Manan · April 26, 2021, 7:44am

Thanks, Adarsh for such a quick reply.

I have tried qsub -l select=3:ncpus=16:mem=350gb:mpiprocs=16 -l place=scatter – /bin/sleep 1000
the job goes in queue state and then gets terminated stating insufficient resources.
I want to run a job that requires 1TB but our each node has 376 GB of memory so I was want a way such that such I can run a parallel job using the memory from different nodes.

As we have 12 nodes with 376 GB of memory each so could you please give us a solution to it

Thanks in advance

adarsh · April 27, 2021, 7:37am

Please note : if the requested resources to run the job are not available, then the job would be in the queued state until the resources are available, they would not be deleted. if the job gets deleted/terminated, then there must be some other issue.

The parallel job would use 350GB from each of the compute nodes (3 in this case) and the total memory requested for this job would be 1.x TB.

If your job has terminated, then please share us the output of

source /etc/pbs.conf ; $PBS_EXEC/unsupported/pbs_dtj < job id >
eg. source /etc/pbs.conf ; $PBS_EXEC/unsupported/pbs_dtj 111

Manan · April 27, 2021, 1:06pm

thanks, Adarsh for your tremendous support.

I was requesting resources more than available, issue solved.

thank you

edumendaya · November 28, 2024, 5:39am

I am just involved in a similar issue too.
Manan: you mentioned you were ‘requesting resources more than available’. How did you fixed the problem? I mean, which part of the line
‘qsub -l select=3:ncpus=16:mem=350gb:mpiprocs=16 -l place=scatter – /bin/sleep 1000’ did you adjusted to not overpass your cluster’s resources?
Thanks in advance

adarsh · November 29, 2024, 5:13pm

This means you are asking for 3 compute nodes , each node with 16 cores, 350GB RAM.
So now you need to check whether

each of your compute nodes have 16 cores and 350GB of memory
you can share the obfuscated pbsnodes -av output if in case node names are classified.

edumendaya · November 29, 2024, 7:43pm

Adarsh: thank you very much!

I’m going to try considering this.

Topic		Replies	Views
How can I limit the amount of memory used by a job Users/Site Administrators	3	616	April 11, 2021
Jobs always in pend status suddenly Users/Site Administrators	5	2011	October 28, 2019
Used memory display problem Users/Site Administrators	5	116	May 30, 2024
OpenPBS is not able to use all given nodes Users/Site Administrators	2	438	July 20, 2021
Job not getting distributed among nodes Users/Site Administrators	41	3104	June 19, 2022

Memory usage across nodes

Related topics