Memory usage across nodes

I am a new user to PBS

I have a 12 node cluster with 376 GB of memory on each node but I want to run a parallel job that requires 1TB of memory. So is there a way that PBS can give a argument such that I can Use the memory of other nodes parallely and run my job of 1TB.

Please try this
qsub -l select=3:ncpus=16:mem=350gb:mpiprocs=16 -l place=scatter -- /bin/sleep 1000

Thanks, Adarsh for such a quick reply.

I have tried qsub -l select=3:ncpus=16:mem=350gb:mpiprocs=16 -l place=scatter – /bin/sleep 1000
the job goes in queue state and then gets terminated stating insufficient resources.
I want to run a job that requires 1TB but our each node has 376 GB of memory so I was want a way such that such I can run a parallel job using the memory from different nodes.

As we have 12 nodes with 376 GB of memory each so could you please give us a solution to it

Thanks in advance

Please note : if the requested resources to run the job are not available, then the job would be in the queued state until the resources are available, they would not be deleted. if the job gets deleted/terminated, then there must be some other issue.

The parallel job would use 350GB from each of the compute nodes (3 in this case) and the total memory requested for this job would be 1.x TB.

If your job has terminated, then please share us the output of

  1. source /etc/pbs.conf ; $PBS_EXEC/unsupported/pbs_dtj < job id >
    eg. source /etc/pbs.conf ; $PBS_EXEC/unsupported/pbs_dtj 111

thanks, Adarsh for your tremendous support.

I was requesting resources more than available, issue solved.

thank you

1 Like

I am just involved in a similar issue too.
Manan: you mentioned you were ‘requesting resources more than available’. How did you fixed the problem? I mean, which part of the line
‘qsub -l select=3:ncpus=16:mem=350gb:mpiprocs=16 -l place=scatter – /bin/sleep 1000’ did you adjusted to not overpass your cluster’s resources?
Thanks in advance

This means you are asking for 3 compute nodes , each node with 16 cores, 350GB RAM.
So now you need to check whether

  • each of your compute nodes have 16 cores and 350GB of memory
  • you can share the obfuscated pbsnodes -av output if in case node names are classified.

Adarsh: thank you very much!

I’m going to try considering this.

1 Like