I have a 12 node cluster with 376 GB of memory on each node but I want to run a parallel job that requires 1TB of memory. So is there a way that PBS can give a argument such that I can Use the memory of other nodes parallely and run my job of 1TB.
I have tried qsub -l select=3:ncpus=16:mem=350gb:mpiprocs=16 -l place=scatter – /bin/sleep 1000
the job goes in queue state and then gets terminated stating insufficient resources.
I want to run a job that requires 1TB but our each node has 376 GB of memory so I was want a way such that such I can run a parallel job using the memory from different nodes.
As we have 12 nodes with 376 GB of memory each so could you please give us a solution to it
Please note : if the requested resources to run the job are not available, then the job would be in the queued state until the resources are available, they would not be deleted. if the job gets deleted/terminated, then there must be some other issue.
The parallel job would use 350GB from each of the compute nodes (3 in this case) and the total memory requested for this job would be 1.x TB.
If your job has terminated, then please share us the output of
I am just involved in a similar issue too.
Manan: you mentioned you were ‘requesting resources more than available’. How did you fixed the problem? I mean, which part of the line
‘qsub -l select=3:ncpus=16:mem=350gb:mpiprocs=16 -l place=scatter – /bin/sleep 1000’ did you adjusted to not overpass your cluster’s resources?
Thanks in advance