Pmem enforcement? if pmem > mem

trying to get my head around the fact that pmem is not a consumable in some way… and this from the big book.

“The limit for pmem is enforced if the job specifies, or inherits a default value for, pmem. When pmem is enforced, the limit is set to the smaller of mem and pmem. Enforcement is done by the kernel, and applies to any single process in the job.”

So you can submit a job that is -l select=1:ncpus=5:mem=1gb -l pmem=100gb into a queue that has a max of 5gb and the scheduler is OK with this? But 1GB would be the actual enforced limit?

(and yes, using cgroup hooks)

Just so I understand? (and there seems to be precious little on the web about this)

I haven’t had a chance to code something to test this, but curious if someone knew the answer off the top of their head.

Ty.

Based on a quick scan through the code. I think pmem is just a way for you to specify how much total memory the execution host should have, not how much your code is going to use. Which is why it is not consumable. Suppose your code uses only a couple gig of memory itself, but repeatedly reads 10gb of data from files. It would be nice if it ran on a node with, say, >14gb of memory so those files could be cached. Especially important if you run multiple MPI ranks on the same node, reading the same files. Each rank needs a certain amount of working memory, but they share the file cache.

This is just a guess. Someone correct me if I’m wrong.

Thanks for the scan. Yeah, its weird. It is also per process, so that is why the number I picked was more than ridiculous. regular mem is proportionally shared so it “sounds” consumable but the big book makes it sound like a polite ask.

Not much to find there or google, though some university sites specified it without a mem= piece. This picks up cause. Def deeper insight would be appreciated or a doc from somewhere (and it aint in their docs that I have seen)

I got some code form a vendor and they specify in their wrapper but not mem and it just seems… wrong and a rule violation (to me).

I’ll edit this in (after original post)

still not clear why this doesn’t stop the scheduler when mem is restricted in a queue considering the definition above when max mem is set in the queue (“When pmem is enforced, the limit is set to the smaller of mem and pmem”), but from

https://info.nrao.edu/computing/guide/cluster-processing/appendix/memory-options#section-3

Seems like it enables swapping more cleanly, whereas mem will kill or swap depending on your settings. Sounds lethal to diskless nodes.

For both single-node and multi-node jobs, pmem is the maximum amount of memory expected to be used per processor, per node. If asking for multiple processors (via ppn) then the scheduler will multiply pmem by the number of processors requested and look for that much available memory. For example, if you use -l nodes=1,ppn=2,pmem=3gb then the scheduler will look for one node with 6GB of memory available. If you use -l nodes=2,ppn=4,pmem=3gb then the scheduler will look for two nodes, each with 12GB of memory available.

If the total amount of memory used by all the processes combined on any node in the job exceeds pmem*ppn, then those processes on that node will swap. Processes on a node can exceed pmem without swapping as long as the total stays under pmem*ppn.

Based on what you found, my guess about pmem is wrong. (Note that the NRAO page you reference is for Torque, rather than PBS. They might differ here.)

Let us know what you eventually figure out. I’m curious.

1 Like