PP-507: Add support for filtering nodes as per the job request

Hi,

As discussed on design discussion page for PP-506, I have now separated out design proposal for PP-507 to “this page”.

Please review the design proposal and provide comments.

Regards,
Arun Grover

Thanks for separating this from PP-506

We’d like to see nfliter applied at the chunk level in the select statement. We certainly have jobs that require 2+ different types of nodes in a single job, so a job-wide nfilter would not suffice.

Chunk-level nfilter leads to the next complications (opportunities?), something I’ve pondered a bit as we’ve considered how to move toward an nfilter-like request mechanism. Here are a couple use cases I’m considering. I’ll use the resources_available resource “model” in my examples, at present model can be one of [san, ivy, has, bro] (representing Sandy Bridge, Ivy Bridge, Haswell, and Broadwell).

uc1) Multi-chunk job that chooses different MPI rank distributions per chunk based on memory, an example request might look like:

qsub -l select=1:mpiprocs=1:model=has:nfilter="node['resources_available'][‘mem’]>=512gb"+100:mpiprocs=12:nfilter="node['resources_available'][‘ncpus’]>=12 and node['resources_available'][‘gigs_per_core’]>=2gb"

uc2) Multi-chunk job that chooses different MPI rank distributions per chunk. The user doesn’t care which model is chosen, but they do want the same model to be chosen across all chunks. An example request might look like:

qsub -l select=1:mpiprocs=2:nfilter="node['resources_available'][‘model’]=pbs.same and bigmem=true"+100:mpiprocs=16:nfilter="node['resources_available'][‘model']=pbs.same and node['resources_available'][‘ncpus’]>=16"

In the above I’ve introduced a magic value pbs.same to accomplish the intent of the use case. I image another magic value pbs.any could satisfy a different use case where the user doesn’t care which model is provided by PBS. These values are illustrative, I’m not pushing for their use as presented here. The general notion is expressing an inter-chunk relationship. This is similar to the description of “like nodes” in the design of PP-506, but as I understand PP-506 a job-set would need to enumerate all requests that can satisfy the particular inter-chunk relationship rather than here where we constrain the matches a filter can come up with.

-Greg

@gmatthew Thank you for providing your inputs.

I understand that having the filter run on each chunk would probably help in the use cases you mentioned. Although in case of uc2, would having a place=group= not help in getting “pbs.same” effect?

I actually tried a small prototype of having to filter 1000 nodes for each job and submitted 1000 jobs in the setup. It takes around 24 seconds for scheduler to run a job-wide filter on all the jobs once/scheduling cycle. Having to do this on each chunk will bring the performance down further. I think having python do the filtering may not be such a good option, although python does allow users to do a lot more while filtering (like using math module).

Do you see a use case where filtering on the basis of “resources_assigned” would help too? If it is only resources_available then maybe we can do something outside of scheduler.

Thanks,
Arun

If a magic value like pbs.same could be used as place=group=pbs.same, yes I could see that working for my use case. I wonder, though, how quickly we would want filtering that effectively acts as 2+ group-like resources, requiring something beyond the place=group= approach.

There are many interesting questions around the performance of nfilter based on design/use, but probably too early to get into this in great detail. For example the various bits of python overhead (spinning up an interpreter, marshaling, etc) may represent a sizable portion of that 24 seconds - or not. :slight_smile: Bhroam has done some great work implementing our design for scheduling faster with node equivalence, that may (or may not) reduce the performance loss if python is involved. Does python have to be involved, even if it ends up not being performant in this application?

I don’t see a use case involving resources_assigned at the moment for our site, but I could easily imagine resources_assigned being handy for other sites.

@gmatthew Just to again touch base upon uc1 you previously mentioned -
Is that use case result of a one off job that you intend to submit once in a while (maybe like to benchmark) or is it something that is needed in every other job that intends to run on bigger memory nodes in their first chunk and so on?

uc1 is something our users are doing today, and we see a general trend toward setting up MPI jobs in that way in order to avoid overloading the head node with both the generally higher/different workload of rank 0 and the workload of additional ranks.

Can you name it ndfilter instead of nfilter. nfilter sounds like numeric filter :slight_smile:

ndfilter might sound more like node-filter.

Sure I can make that change. @billnitzberg also suggested changes to “nfilter” and asked either change it to “node_filter” or just “filter”.

So do you two have any preference between “filter”, “node_filter” or “ndfilter”?

I’d prefer node_filter. Second preference would be ndfilter.

1 Like

@billnitzberg asked me to provide more use cases for this feature, based on what our users are doing now. Here’s a quick bit of background on how our users request resources, then I’ll get to the use cases. Users pick from one of roughly 7 primary types of nodes using resources_available.model. If they have a need for some specialized subset of a primary type, e.g. bigmem, they request that resource as well. In some cases we have different bigmem sizes, so a user might additionally specify resources_available.mem to request the minimum mem size needed. Then we have onesy/twosey special nodes that might have a different resources_available.model, etc.

uc3) Multi-chunk job that picks a higher memory node for the head node and cheaper (lower accounting charge rate) nodes for the rest:

qsub -l select=1:model=bro:mpiprocs=4+100:model=san:mpiprocs=16

uc4) Multi-chunk job that picks all nodes from the same type, but places fewer MPI ranks on the head node:

qsub -l select=1:model=has:mpiprocs=8+199:model=has:mpiprocs=24

uc5) Multi-chunk job that picks a special onesy node for the last node of the job, and a different type for all other nodes:

qsub -l select=299:model=ivy+1:model=fasthas

Some observations:

a) In general there will be several model types that a user considers “cheaper” as used in uc3, so a node_filter for that particular chunk could express the acceptable set

b) For a request like uc4 we expect that users are concerned more with overall and per-core memory rather than the particular model.

c) We have thought about trying to reformulate select statements to more exactly describe the MPI/mem ratio needed and have PBS parcel out ranks to match. This would potentially recast uc3 to something like:

qsub -l select=1:model=bro:mpiprocs=4+1600:ncpus=1:mpiprocs=1:gigs_per_core=2gb

and PBS could find nodes that match, with possible contraints like pbs.same mentioned earlier in the discussion.

Currently we have moved resources away from this project. So this project is stalled for now.

Our current understanding of the situation is that the original design proposal does not impact the requirements posted by @gmatthew (thanks for posting). So we may continue our discussion.

My main concern is the use of ‘<’ and ‘>’ in the current proposal. Most shells use these characters to redirect I/O. Users will inevitably forget to quote or escape them, leading to potential loss of data. Could you propose some safer alternatives for expressing these conditionals? For example ‘.gt.’ instead of ‘>’ and ‘.ge.’ instead of ‘>=’

I share your concern @mkaro. Since this is a new feature and anything similar to this does not exist in PBS, would it help if we enforce the quoted syntax irrespective of what operator user is trying to use? If you think that would not be a reasonable thing to do then we can go with the way you are proposing to use gt, ge, lt and le as operators. This would also mean that we will have to change the expression to replace the operators because we will end up running it as a python expression.

Also, during recent discussion with @bhroam we found that we have overlooked one major issue in the approach. It is related to security. If we consider node_filter as a python expression and scheduler runs this with an interpreter with root privileges then that is a problem.

Some of the things we could do are following. @bhroam please correct me if I’m wrong:

  • Run scheduler as non-root user. This way expression will not be executed as root. But, it will still be executed under scheduler user privileges and that means fairshare usage file, resource group file, and sched_config would still be accessible to the user.
  • Scheduler impersonates as job user before running the expression. This is achievable but it will have a huge impact on scheduler’s performance.

There could be other ways to address this problem also but these are the two that came out of discussion.

That’s a great point about security @arungrover. While it’s one thing for an admin to define a Python snippet in a PBS Pro configuration file that later gets evaluated, it’s a completely different story when users are able pass Python code as part of their job submission. There are plenty of ways to crash the interpreter or (as you mentioned) obtain elevated privileges… https://stackoverflow.com/questions/35804961/python-eval-is-it-still-dangerous-if-i-disable-builtins-and-attribute-access

For this very significant reason, we should not use Python to evaluate anything a user has provided. Would you agree?

Hi everyone,

I have updated the design approach for this enhancement. I encourage everyone to have a look at the design document again and provide comments.
Please follow this link - https://pbspro.atlassian.net/wiki/spaces/PD/pages/51249415/STALLED+PP-507+To+support+jobs+to+be+able+to+filter+out+nodes+by+allowing+them+to+use+conditional+operators+in+resource+reque

Thanks,
Arun

Hi All,

I’ve again updated the document. Following are the updates I’ve made -
1 - Filter can only use ‘resources_available’ for filtering the nodes.
2 - Changed the syntax of filter to support evaluation of filter as a python expression
3 - Changed the order in which scheduler will apply filters. The order will now be determined by scheduler to optimize for efficiency.

Please have a look again.

Thanks,
Arun

Thanks @arungrover. A few thoughts/questions:

  1. Does this statement in the EDD:

While applying the filter, if scheduler encounters a resource in the filter that is not present in the ‘resources’ line of sched config then the scheduler will ignore that resource while filtering the nodes."

enable this feature to work seamlessly with the new node-bucket placement algorithm, or are there other considerations needed to achieve compatibility with this feature?

  1. There is a need for this feature (when used to get jobs onto specific hardware generations) to work in conjunction with exiting placement set functionality to maintain network topology awareness when scheduling jobs. I see that you have partially addressed this in the caveats section, but based on my understanding the suggestion to use something like lnode_filter=“color=Green or color=Red” misses one of the key benefits of placement sets which is how the sets are sorted. The filters are applied in any order, so even if a submitter wanted to or the filters together from the smallest group to the biggest this would not work (and is not a burden I’d want to put on submitters anyway). Am I missing something about this section?

  2. If my understanding of 2) is correct and we cannot make this feature work with existing placement set functionality, what about adding a new key word “PBS_ANY” (or something) to say -lnode_filter=“color=PBS_ANY” where it resolves to any single consistent value for the resource present on the nodes and have the filter results returned in a sorted order? This is probably wildly detrimental to performance and may be a completely terrible suggestion, I realize. Would doing this allow us to effectively replace placement sets as a whole? I think not unless the filters that get separated by "or"s were evaluated left to right.

  3. Minor formatting: The third bullet point under caveats should be indented, I think.

Thanks!

For the sake of clarity, how will the Python expression be evaluated? As discussed previously, there are security concerns for using a Python interpreter to perform the evaluation. Will the syntax be “Python like” but the actual evaluation will be handled by code that is internal to PBS Pro?

Thanks @scc and @mkaro for reviewing the document, below are my comments -

This feature should work with bucket placement algorithm the same way as placement set would work. Each filter should actually get their own group of node buckets to look at.

Can you explain a bit more about the problem here. Are you concerned that the filters will have no weightage thereby giving users no control over which filter is looked at first? or is it that you want users to be able to create multiple node groups by specifying one filter and have PBS sort these node groups (small to big)?.

If I understand correctly you want something like “pbs.same” as mentioned by Greg in one of his comments above. I may be wrong, but to me, PBS_ANY would mean any resource value is fine as long as the resource is set on the node. Having a keyword like pbs.same (or PBS_SAME) would give it the same effect as node grouping without having the user to specify all the desired values.
I guess this can be done, but I wanted to keep the syntax without any special keywords because the idea was to have a syntax that can be treated as a python expression and potentially have it evaluated someplace else (maybe in a hook).

@mkaro You are right that there are security concerns evaluating it using a Python interpreter. The idea was to keep the syntax Python-centric because in future if the scheduler is run under user privileges it can potentially evaluate it using an interpreter or the expression can be evaluated by one of the hooks by running the expression under user privileges. This is why “Python-like” will help us keep the same syntax while we will have the flexibility to change the underneath implementation.

Thanks @arungrover, but I still have concerns. There are ways to restrict eval() in Python to a small set of commands, but we would have to be very restrictive. There is a short description of this here: http://lybniz2.sourceforge.net/safeeval.html

Even if the scheduler runs as a non-root user it is unlikely it is running as the user submitting a job, so I believe there is still a security concern. If I understand correctly, the initial implementation will NOT utilize a Python interpreter. Is that correct? We may consider changing this in the future if we find a secure way to evaluate the expression in an interpreter.