PP-506,PP-507: Add support for requesting resources with logical 'or' and conditional operators

arungrover · February 1, 2017, 11:11pm

Initial Idea was that scheduler will give preference to the first select spec and the other select specification will act more like plan B for the job.
Based on this assumption scheduler was going to sort the jobs using the first select specification only.

If we start sorting on each resource requests of the jobs then we will end up looking at the same job multiple times, I think it will be more time consuming for scheduler to find out which resource request it can choose to run the job.

mkaro · February 1, 2017, 11:26pm

It will take more time, because there is more data to consider. That’s unavoidable. I don’t think the scheduler should give preference to any of the options the user presents. Users are simply expressing that their job is flexible when it comes to allocated resources. In exchange for being flexible with the requirements, the job is more likely to get scheduled for execution sooner than if it had only presented only one resource request.

I think we need to look at this from the user’s perspective. What behavior is the most intuitive? They just want to get their job to run so they can do their work. If being flexible with resource requirements helps them do that, then they will make use of the feature. I think backfill is a good scenario to consider. If my job can look like a wide rectangle, a square, or a tall rectangle, then chances are I’ll be able to find a place for it sooner.

arungrover · February 3, 2017, 7:26pm

I think it would be best that we discuss about the syntax of specifying multiple resource specifications over a meeting.

I’d request everyone who is interested in meeting to join a “Goto meeting” on coming Monday (6th of Feb @10:30 am PST)

Here are the meeting details:

Please join my meeting, Feb 6, 2017 at 10:30 AM PST.
https://global.gotomeeting.com/join/205581149
Use your microphone and speakers (VoIP) - a headset is recommended. Or, call in using your telephone.

United States: +1 (571) 317-3117
Australia: +61 2 9091 7603
Brazil (toll-free): 0 800 047 4902
Canada (toll-free): 1 888 299 1889
Canada: +1 (647) 497-9373
China (toll-free): 4008 866143
France: +33 (0) 157 329 481
Germany: +49 (0) 692 5736 7208
Greece (toll-free): 00 800 4414 4282
India (toll-free): 000 800 852 1424
Israel (toll-free): 1 809 452 627
Italy: +39 0 291 29 46 27
Japan (toll-free): 0 120 242 200
Malaysia (toll-free): 1 800 81 6860
Mexico (toll-free): 01 800 083 5535
South Africa (toll-free): 0 800 555 451
South Korea (toll-free): 0806180880
Spain: +34 932 75 1230
Sweden: +46 (0) 775 757 471
Taiwan (toll-free): 0 800 666 846
United Kingdom: +44 (0) 20 3713 5011

Access Code: 205-581-149
Audio PIN: Shown after joining the meeting

Meeting ID: 205-581-149

gmatthew · February 3, 2017, 8:26pm

I probably won’t have a chance to join the meeting, so I’ll drop a few thoughts in here.

Approaching this from a high-level, I consider that there’s a function f(job) that describes all the possible “resource requests” that would make that job useful to the user. I define a single “resource request” as everything a qsub might ask for (nodes, memory, walltime, the works).
As most of our users currently understand PBS, they have to pick one resource request when they qsub. A notable exception to this is shrink-to-fit (STF) walltime. STF walltime lets PBS explore a continuous slice of possible resource requests. Specifying consumable resources can also represent a type of slice, since qsub’s request is treated as a minimum value in some cases.
Most of what I see discussed in this topic improves the life of users by allowing them to convey more of f(job) to PBS by describing more than one discreet resource request (or slice in the case of STF walltime). It may be out of scope, but it’d be interesting to go further down the path of resource request slices. Some discussions with NAS have pondered STF vnodes, but it’s never reached high enough priority to get beyond discussion.
In the near-term a feature like --OR described by Bill would be useful to NAS, our users could be guided to ask for different numbers of different types of nodes to meet core count and memory needs. This is because we currently require that jobs request specific node types - users have to ask for an Ivy Bridge or Haswell or …, they can’t just say “give me the next free thing, doesn’t matter what it is”.
In the mid-term/long-term our goal is to get users back to describing resource requests as attributes, e.g. a job needs X gigs of memory per rank, with N ranks, distributed in Y fashion, etc. Once we have that in hand I could easily see us having renewed interest in resource slices as described in 3.

arungrover · February 6, 2017, 11:08pm

PP-506 & PP-507 — Update from 2017-02-06 Public Design Discussion

Attendees -
@billnitzberg, @scc, @jon, @bhroam, @mkaro, @smgoosen, @arungrover

Consensus on guiding principles for this design:

C1 - Extend the resource selection options with a focus on flexibility. So, although many of the use cases are site-wide policies, and could be (better?) implemented via sched_config options, this design proposes using per-job resource requests, as that is more flexible overall.
C2 -Target optimizing job start times. It is understood that optimizing for other measures such as end time, cost, power, utilization, and reliability are important, and other parts of PBS Pro focus on these areas; they’re just not the focus of this feature.
C3 - Target SysAdmin use cases (90% of focus); secondarily, target power users (10%).
C4 - Target automation and tools when designing new language syntax (90% of focus) above the secondary consideration of human readability (10%). E.g., make parsing and tooling easier to code, and don’t worry about shell quoting.
C5 - As always, consider backward compatibility to protect existing investments by the PBS Pro community in tooling, e.g., hooks and tools that parse qstat output and accounting logs. So, it is better to offer an entirely new syntax that supports these new capabilities, and also continue to support the existing (-l, select, place) (unmodified) syntax for backward compatibility.

General Use Cases:

G1 - Start jobs sooner by providing additional allocation choices (with preferences to partly balance versus other competing goals)
G2 - Increase efficiency (better cost, speed, power use, reliability) by better matching job classes with preferred resources (e.g., XYZZY jobs run better on big memory nodes, ABCDE jobs run better on AMD cpus, but they all run everywhere, just not quite so well)
G3 - Prevent erroneous allocations (e.g., application requires Linux 2.6 or later)
G4 - Adjust request based on the allocation itself.

Specific example use cases:

U1 - Run on new hardware if available… if not, run on older hardware
U2 - Don’t use big memory nodes… unless there’s nothing else
U3 - Run low priority jobs on old hardware, if available… but, if they aren’t available, then use the new hardware. And, vice-versa with high-priority jobs.
U4 - XYZZY software runs better on big memory nodes, but is OK on small memory nodes, and depending on where XYZZY runs, it needs different numbers of licenses
U5 - Do what LSF does in terms of boolean & conditional resource requests — note: it was agreed that this was not well-defined enough, and would need a lot more definition to be truly useful.
U6 - My job needs Red Hat 7 or higher; or only runs on Linux
U7 - Assuming nodes are appropriately labeled, ensure my job gets whole nodes and fills them up (e.g., ensure the job gets 16 cores on 16 core nodes or 24 cores on 24 core nodes)
U8 - (Deleted)
U9 - Adjust execution time limits based on which type of nodes are allocated (example a job may take less walltime to run on new hardware as compared to the old hardware)

[Most of these use cases can be met even today with existing PBS it’s just that it may not be able to satisfy all of them together]

Concerns / Discussion:

CD1 —filter idea: another workload manager breaks this problem into two steps: a filter step (to choose a subset of nodes) and an allocation step (to allocate resources from the chosen subset). A strategy like this might allow PBS Pro to provide good backward compatibility (by adding only a new “—filter” pseudo-resource, for example), but it would not support one of the major use cases (G4). So this direction was abandoned.
CD2 - There is a potential conflict between SysAdmin defined policy (e.g., big jobs get top priority) and individual job “policy” (e.g., job X requests ncpus=10 || ncpus=1000 ). One way to address this “conflict” could be to treat each resource request separately (e.g., job X is prioritized by the scheduler as two separate jobs, one asking for ncpus=10 and another asking for ncpus=1000). More thought needed here…

Next Steps:

Review alternative resource request language, e.g., OGF JSDL, OASIS, … then hold follow-up meeting

arungrover · February 10, 2017, 11:32pm

Attendees: @billnitzberg, @bhroam, @scc, @mkaro, @altair4,@jon, @arungrover, @bayucan
Here is the outcome of discussion from the meeting

Proposal

qsub -A “abcd” -l “select=3:ncpus=1:mem=2gb:rhel>=6:color!=blue” -lscratch=5gb -lplace=pack -lgtmpfs>2gb --OR
-l “select=1:ncpus=3:mem=6gb:centos>=5” -lscratch=100gb -lgtmpfs>2gb -lplace=free job.scr

Syntax restrictions in the interest of time that allows future extentions.

between —OR each key can only show up once in each select.
Keys for consumable resources can only accept ‘=‘

billnitzberg · February 10, 2017, 11:56pm

Long meeting – thanks to all who attended!

Just to add a bit more info from the discussion:

No matter what syntax is used (extending select or JSON or ?), existing hooks will need to be modified to accept the new syntax. There are two ways to mitigate some of the pain:
- A submit hook that rejects all jobs with any new syntax would allow admins to turn off the new syntax entirely, so existing hooks will work (and rogue users would not be able to cause havoc), and
- Have the implementation provide backward compatibility with the old syntax after the job is started. So, only submit and alter hooks would need to be updated to deal with the new syntax; other hooks (MOM, and anything run after start) could use the older syntax without issue. The new syntax would still be available, just not in a way that interferes with existing (non-submit, non-alter) hooks.
There’s still lots to define regarding the above proposed syntax, e.g.,
- After I submit a job with the above, what does qstat -f look like? Today it has a single Resource_List.select and a single Resource_List.scratch, etc.
- Are there other restrictions we should apply (to speed up development) that will not affect the current targeted use cases? For example, maybe we should not allow comparisons for job-wide resources (e.g., disallow gtmpfs>2gb).

arungrover · March 16, 2017, 8:58pm

02/21/2017

After a long discussion about how the job submission syntax might look like and Issues/questions related to interaction of this RFE with already existing PBS features. Following were the proposals on the syntax –

1 - qsub -A “abcd”  -lscratch=5gb -lplace=pack  --OR -l “select=1:ncpus=3:mem=6gb:centos>=5” -lscratch=100gb  -lplace=free job.scr
 
2 - qsub -A “abcd” -l “filter=rhel>=6:color!=blue” -l “select=3:ncpus=1:mem=2gb+2:ncpus=4:rhel=7” —OR …
 
3 - qsub -A “abcd” -l “filter=3:rhel>=6:color!=blue+2:color=blue” -l “select=3:ncpus=1:mem=2gb+2:ncpus=4:rhel=7” —OR …
 
4 - qsub -A “abcd” —SPEC -l “filter=rhel=~[^6\.[013456789]$|^7\.[0-9]$]:color!=blue” -l “select=3:ncpus=1:mem=2gb+2:ncpus=4:rhel=7” —OR 
-l “select=1:ncpus=3:mem=6gb” -l”filter=centos>=5” -lscratch=100gb —END_SPEC -lplace=free job.scr
 
5 - qsub
 
#PBS -A “abcd”
#PBS —SPEC -l “filter=reel=~[^6\.[013456789]$|^7\.[0-9]$]:color!=blue” -l “select=3:ncpus=1:mem=2gb+2:ncpus=4:rhel=7” —OR   -l “select=1:ncpus=3:mem=6gb” -l”filter=centos>=5” -lscratch=100gb —END_SPEC
#PBS -lplace=free
 
6 - qsub
#PBS -X -l “filter=reel=~[^6\.[013456789]$|^7\.[0-9]$]:color!=blue” -l “select=3:ncpus=1:mem=2gb+2:ncpus=4:rhel=7”
#PBS -X -l “select=1:ncpus=3:mem=6gb” -l”filter=centos>=5” -lscratch=100gb
 
7 – qsub
#PBS -lmulti_select="2:ncpus=3:mem=2gb||2:ncpus=4:mem=1gb"
#PBS -ljob_wide="walltime=720+place=scatter||walltime=480+place=pack”
#PBS -lfilter=“centos>=5”
 
8 – qsub
#PBS -Wspec (count=16 ncpus=8 mem=16gb color=red) OR (count=8 ncpus=16 mem=32gb color=red)
 
9 -    {-lselect=4:ncpus>=2:mem=[%ncpus% / 2]gb -lwalltime=[(3600 / %ncpus%) + 300]}
 
{
    # Job wide parameters
    {-joe -N myjob -A physics} AND
    # Resource options
    {
        # Red cluster requirements
        {
            # Small memory nodes
            {-lselect=16:ncpus=8:mem=16gb:color=red -lwalltime=600} OR
            # Big memory nodes
            {-lselect=8:ncpus=16:mem=32gb:color=red -lwalltime=600}
        } OR
        # Blue cluster requirements
        {
            # Small memory nodes
            {-lselect=16:ncpus=4:mem=8gb:color=red -lwalltime=1800} OR
            # Big memory nodes
            {-lselect=8:ncpus=8:mem=16gb:color=red -lwalltime=1800}
        }
    }
}
 
 
Discussion about syntax –
-          Option 1 can be broken down into two pieces – select and filter and then the way we specify select specification does not have to change. This resulted into Option 2 where nodes can be filtered by specifying the filter with each job and select specification will show the resources to be assigned.
-          Option 3 is an extension of Option 2 with different filter for each chunk specified in the select spec. We may not need it right now but it shows that filter can be extensible.
-          Option 4 wraps all of the ORed select and filters in a new attribute of sorts. It starts with –SPEC and ends with –END_SPEC (we may decide to name it something else)
-          Option 5 to Option 8 - This is probably how we would specify these job submissions using PBS directives. Option 5 and 6 also makes the filter a regular expression (may be similar to that in bash) that would help in filtering nodes. May be we can make it a python expression? Don’t know as of now.
-          Option 9 – This is another way of specifying job submission requests. It has an advantage of computing some of the resources based on what the job is selected to run with. It is readable and more machine centric.
 
Questions about the syntax –
-          Do we need a global filter which means one filter that can be used or applied to all the ORed select specification?
-          Do we need to filter on job wide resources as well?
Ø  There is no requirement to filter on job wide resources as of today.
-          When do we do filtering? Do we do it at job submission time or job start up time?
Ø  A node can change its resource (like color) at any time so filtering should be done at the time of job scheduling (in other words job start up time).
-          How do we pass multiple select spec and filters to hooks?
Ø  It can be passed as a list of dictionaries where each dictionary is an ORed select spec. Filters can also be used as expressions, they can be used with node resources to filter down nodes.
 
Conclusions based on discussions:
-          With this change hooks will need changes and there is no way around that.
-          There is no reason to have a per chunk level filter.
-          There is no need to support conditional operators in job-wide resources.
-          We should try not to overload select and come up with something new to specify ORed specifications.
-          OR can only be applied to resource requests or filters not to other job parameters like account strings, block attribute etc.      
 
Discussion about how this requirement interacts with other PBS features:
-          Limits:
o    Queued limits (queued and running resources):
Ø  When the job gets queued, max of all the resources across all the ORed resource request is checked against the queued limits and then accepted/rejected based on that.
o    Queued threshold limits (only queued resources):
o    Run limits (hard and soft limits):
Ø  Run limits in its current form are applied before finding the node solution and scheduler may continue to do so if the jobs with multiple ORed resources appears multiple times in jobs political ordering.
-          Job sort formula/Fairshare:
Ø  Fairshare is probably okay with this change because it is computed based on running job’s resources used. Job_sort_formula computation can result into a job appearing N number of times in a list where N is the number of ORed resource request it was submitted with.
-          Jobs inside reservations:
Ø  Jobs having ORed resource requests should be allowed to be submitted to a reservation even when one or more than one resource requests exceeds the resources assigned to the reservation (at max this job may not run).
-          Array Jobs:
Ø  An array job with ORed resource requests can potentially end up running with each sub job having a different resource request.
-          Reservations with new resource request language:
o    Advance
Ø  It is okay to have reservations requesting ORed resource specifications.
o    Standing
Ø  Each occurance of standing reservation may get a different resource signature depending on what it was chosen to run with.
-          Requeued jobs:
Ø  When jobs with ORed resource requests are requeued, all resource requests are again looked upon. It may get rejected based on what is the state of the queue then.
-          Routing queue:
-          Checkpoint abort:
-          Compatibility with existing hooks dealing with select and place:
-          New resource request language/filter to be used with queue/server defaults or min/max_resources:
Ø  Maybe filter can be used to specify max/min resources and this filter can also be a python expression that can take the decision on the basis of the select spec it come across.
-          Calendaring/Top-N-jobs/EST:
-          Eligible-time:
-          Accounting logs:
 
Open questions:
-          When to put the job with multiple resource requests on calendar? What happens when out of many resource requests a job gets on calendar using the first one and with the second request it can actually run? Do we decide to put on the calendar when we know it will earlier as compare to other resource requests?
-          What happens when different resource requests have a different decision on accruing eligible time?

arungrover · March 16, 2017, 9:00pm

02/28/2017

@billnitzberg came up with an interesting view of looking at the problem we are trying to solve with multiple ORed resource requests.
Here is what he brought up:

We are trying to solve a problem of running one job which can run on one of the multiple resource requests specified. If this is the case, then instead of having multiple ORed resource requests we can probably expose a way to submit multiple jobs and have all of these jobs linked and run only one of them. This approach will be well defined and PBS will continue to handle the jobs as the way it handles them now.
Doing this will also save us from a lot of machinery that we will have to add to deal with these special jobs and making them appear at different places while sorting jobs etc.

Here is what syntax we can possibly use if we have to do this –
There are probably 2 ways we can do this using the already existing constructs in PBS (there might be more, but I could only think of two) –
1 – Use the syntax similar to that we use for array jobs and somehow specify that only one of the jobs will run.
2 – Use the way we make jobs dependent on each other and then make only one of them run.

arungrover · March 16, 2017, 9:05pm

03/03/2017

@subhasisb proposed -
Implementing like job dependency is probably easier, by adding a new condition for the dependency between multiple jobs. But then syntactically it does not sound that clean an approach – you will have to submit multiple jobs with conditions between them, and somehow tell PBS that your “group” is now complete…i.e., has all the alternatives you wanted to specify.

Maybe we can do something easier. Add a new job attribute called “job group” (or call it job-alternatives etc. – I am sure we can find a better name). So, if a job specifies a new group name (an arbitrary string as value), then it automatically becomes part of that group (possibly a new one). We may even not need a way to close the group. Given our current requirement, we can say that you can add to a group only if one job from the group is not already running. As long as no job from the same group is running yet, you can submit another job to the same group. Easy to implement and satisfies all our requirements

Advantages of this proposal:

     With the array jobs syntax, unless you change a lot, we would be required to submit everything in one shot – increasing complexity.

     With the job dependency, its better – but still has the semantics of holding the jobs till everything is set and then releasing a hold – besides it would be more difficult to shoe-horn this into the dependency jobs.

     With the concept of a dynamic pool, you submit a job in the pool, and can continue submit more jobs with varying resource requests to the same pool – no holding, no end time. One can submit till no jobs from the pool is running yet. Pitfall: Somebody could use this “freeform” string name and submit to the wrong pool.

arungrover · March 16, 2017, 9:41pm

Here is a small comparative study of how each of the syntax proposal would affect or behave with some of the existing features/commands in PBS.

arungrover · March 16, 2017, 9:45pm

03/08/2017

@jon mentioned -

Maybe I see things differently but I am not really enthused about the job-group idea as far as I understand it. I have the following concerns:
· For users to submit this it requires place a hold on the job
· Submits multiple jobs
· Release hold.
Now I know that we are targeting this for machines but (10-20%) will be done by users (or their scripts) and I don’t know how their job submission scripts will do this without them creating a job-group submission script that submits their jobs. Also, this approach will trigger multiple additional job submissions along with the additional overhead to the server (i.e. job submission, queuejob hooks, file copy, etc). The additional job scripts will also negatively affect the data base size. As for deleting jobs, we would have multiple jobs to delete which will require qdel rework to understand the job-groups. Another concern is what happens if the user submits a short running job without holding it and it finishes before they can submits subsequent jobs? Will the job be rerun? And finally, how does the user define the job-group? Does the user have to come up with unique job-group names for every job-group they submit?
I would argue that we should try to do something that can easily be done in our standard job submission scripts that the user can submit once. I know this may be harder to implement but the above does not seem user/admin friendly.

arungrover · March 16, 2017, 10:17pm

03/14/2017

After another round of meeting. There is some convergence on the design and following is the outcome -

Consensus:
a. Schedule each individual resource request as a separate job (but only run one)
b. Add new “—filter” syntax to handle non-consumable conditionals, and make it apply job-wide
c. Each resource request should be individually addressable (have a handle / ID)
d. Design must support resource requests from the same “job pool” appearing in different queues (but doesn’t need to be implemented now)
e. Design should allow subsequent job submission to be added to the same “job pool”

No compelling use cases for:
i. Atomic submissions (all or nothing in terms of resource requests)

Some, but not full consensus:
x. For now, don’t worry about the potential impact on qsub performance, e.g., each resource request could incur a full “qsub time”

Top contenders for syntax:

job_pool with some way of saying “all of them”
123{0}.bill.co

Should the design support job arrays?
a. job array of 10 elements on either red nodes or blue nodes
b. 10 red nodes OR job array of 20 jobs, each with 1 blue node

sgombosi · March 17, 2017, 5:40pm

I have some serious concerns about the viability of the multiple job approach:

We already have sites with very large job volumes who find the recovery of queues and jobs to be burdensome. In some cases, the overhead of job recovery is bad enough to make server restarts extremely disruptive and failover practically useless. In some cases, the job volume is such that history cannot be enabled without running the risk of crashing the server (I know of two US sites where this has occurred). This approach is guaranteed to make these problems worse by clogging up the queues with phantom jobs which we’re eventually going to delete anyway. At best, it will just make the server less responsive.

I just don’t think it’s a scalable approach.

arungrover · March 17, 2017, 8:20pm

@sgombosi you bring an interesting point about server start up.
You are right about multiple jobs taking more time in starting up the server. May be we can do some optimization on how we load all the pool jobs (may be if a pool job is running then load only that running job), Or delete all other pool jobs as soon as one starts running. I’m not sure, I do not have a definitive answer for the concern you have raised.

This functionality is expensive, It not only takes additional time in server but also in scheduler when it tries to run job with all specified resource requests (and that time is much more than the time server will take up to handle these jobs). There isn’t any requirement which says this but, I was expecting such jobs to be only handful of jobs.

If we don’t create multiple jobs then yes, server will not take time to enqueue these job (at queue time or startup) but internally scheduler will need to create these phantom jobs every time a scheduling cycle runs unless we get the job running on one of the specified resource requests. Single job approach also needs to define a lot of unknown semantics around how limits (queued and run) will be applied to each resource request, How default resources on queues are applied to each resource request, It may also result into an array job having each of it’s subjob running on completely different resource request when submitted with this syntax.
Both these approaches have their own ups/downs.

sgombosi · March 19, 2017, 8:51pm

I guess my point is this:
We cannot possibly avoid additional overhead in the scheduler, because the scheduler just has more decisions to make (no matter how this is implemented). However, we can avoid (mostly) additional overhead in the server - and in my experience bogging down the server has far worse consequences than bogging down the scheduler (“split-brain” due to false failovers, total lack of responsiveness to external commands, and slow scheduling). I would take a long scheduling cycle over a brain-dead server any day of the week. A slow scheduler reduces system throughput - a slow server makes a system unusable. One is an annoying inconvenience, the other is potentially a catastrophe.

arungrover · March 20, 2017, 2:12am

I get that, queueing jobs in server is going to affect it’s performance. But, in future we may have multiple servers catering to client requests and I’m hoping queueing the jobs wouldn’t be as expensive as it sounds today.

billnitzberg · March 21, 2017, 1:47am

Thanks for posting the updated v.10 design (https://pbspro.atlassian.net/wiki/pages/viewpage.action?pageId=49865741).

I really like the direction this is going, and especially that backward compatibility may be more easily accommodated, e.g., the new filter resource is a restriction, so existing qsub hooks (many of which are admission control gates) are likely to behave correctly without modification, and that the select & place syntax is unchanged (so, again, no hooks code needs to be changed there). Obviously, some changes will be necessary if a site wants to support the new capabilities, but this design may lessen the “backward compatibility re-engineering load”.

A few comments:

In case scheduler finds out that it can not run such a job because of resource unavailability and tries to calendar the job so that resources can be reserved for this job in future, it will use only the first resource specification that it encounters in it’s sorted list of jobs and use that to calendar the job.

What issue is this is attempting to address? Unless there is a strong understanding of a known issue, it would be better to start by treating each job as a “regular job”: no caveats except the “only run one” behavior. (More caveats means more complexity means less resilience and less adoption – simpler is almost always better.) I would suggest dropping this (for now), and see what early adopters find as the real issues (ideally, during a Beta). Then, if there is an issue, fix it, and fix it right.

If running job which was initially submitted with multiple resource specifications gets requeued for any reason (like qrerun or node_fail_requeue or preemption by requeue), the job will get reevaluated to run by looking at each of the multiple resource specifications it was initially submitted with.

If there is not a compelling use case for handling requeues, an alternative to this would be to change the semantics from “run only one to completion” to “start only one”, and once one job in a set is started, delete the rest. This would make it easier to define what happens for some operations (e.g., how to handle qmove to another server aka peer scheduling), and would also likely reduce implementation and test effort. Again, as in the above, one would want to adjust based on early adopter feedback.

Interface 1: New Job attribute called “job_set” - qsub option “-s”

Do we really need a single character option? Why not just use -Wjob_set= as the only interface?
If 103.mycluster.co is a job_set and 104, 105, and 106 are members, can I submit a job that has -Wjob_set=104?

When a job is requested with multiple select specifications, PBS server will honor the queued limits set on the server/queue and run job submission hook on each of the resource specification. If one of the resource specification is found to be exceeding the limits then that resource specification will be ignored.

What is the output of such a command? Is it one job ID or many job IDs?
Not sure about ignoring an rejected request – shouldn’t the behavior be the same as if I submitted a job with qsub -Wjob_set= …, and wouldn’t that be to throw an error? What happens if all the requests are ignored?

Interface 3: Extend PBS to allow users to submit jobs with a node-filter (nfilter) resource

Suggest “filter” instead of “nfilter”.

To access a specific resource out of resources_available, resources_assigned inputs, users must enclose each resource name within square brackets “” like this - “resources_available[‘ncpus’]”

Q: Is this the same syntax used in PBS hooks and the Scheduler’s job_sort_formula? (Ideally, we should have only one syntax.) Sorry, I just can’t recall this one… If it is the same syntax, I suggest making that statement explicit.

Accounting logs…

It would be good to capture the whole workload trace data in the PBS accounting logs (while also minimizing the impact on existing accounting post processing tools). At least some way to represent that a job is part of a job set, and some way to capture that a non-run member of a set has been removed from the queue. (This probably requires more discussion.)

Wow, thanks!

subhasisb · March 21, 2017, 1:01pm

Hi Steve,

Yes the current implementation of loading jobs at startup and history jobs is not scalable. That has little to do with this RFE. The server anyway does not scale when there are large number of jobs. Sure this change can add more jobs to the server, but it must be understood by users that using these alternate jobs does have an impact on the server. Whether we submit it as one job or not, if every job has an alternate there is anyway a huge tax on the scheduler, so end to end performance takes a toll anyways.

Currently, large number of jobs at startup does not affect the server failover capability, and does not contribute to possibility of a split-brain (besides comms between the two servers, we keep touching a shared file every few milliseconds, even between loading jobs from the database at startup). The worst that can happen is that server runs out of memory or being unresponsive to commands for a long duration.

The point is, when you make the job alternates an explicitly understood phenomenon, we transfer responsibility to the users/admin to restrict their usage. Not just the scheduler, with job alternates there is code inside server that runs validations etc. anyway for limits and such that will anyway bog down server some. The additional overhead (beyond what would anyway happen) with multiple job entries is negligible.

We have plans to implement the server as a stateless service (pending prioritization of that work), where we basically do not keep any job data in server memory (maybe a very small cache). When we have that, we can have millions of jobs in the history without much effort on server restart timings.

subhasisb · March 21, 2017, 1:19pm

Arun,

Thanks for posting the updated v.10 design (https://pbspro.atlassian.net/wiki/pages/viewpage.action?pageId=498657411).

I have a few questions as well.

Users can submit jobs specifying “-s” option during submission. This attribute can only take an already submitted job-id as a value.

So what would happen, if i do this:

qsub sleep.sh → 1.svr
qsub -s 1.svr sleep.sh → 2.svr
qsub -s 2.svr sleep.sh → 3.svr

Would the server recognize that since 3.svr has a jobset id =2.svr which is already part of jobset 1.svr, it will make the 3 as part of the same job set?

Users can specify a node filter with node resources using conditional operator like "<, >, <=, >=, !=.

Since logical operations are supported would we be supporting nested complex expressions? If not, that must be specified.

Interface 4: New job substate “JOB_SUBSTATE_RUNNING_SET” (95)

Do we need this new substate? How about using things that array jobs use - like the Array JOB BEGUN state for the set? Its always complex in the code when one adds new job substate? If you add a state/flag for the “” “job set” object, that is fine. However, adding a job substate must be done very carefully. There is all across PBS codebase that takes specific actions based on the substate that could start failing due to introduction of a new substate (unless all of them are dealt with carefully).

Topic		Replies	Views
PP-662, PP-663: UCR and External Interface document for Reservation enhancements Developers	91	5124	September 13, 2017
Theoretical PBS Scheduler/Server Limits Users/Site Administrators	3	623	January 19, 2022
Dynamic ncpus/nodes/ppn specification Users/Site Administrators	1	1307	January 12, 2018
PP-877: UCR discussion for hyper-threading support in PBS Developers	5	1624	August 2, 2017
[WIP] "Mock run" option for scheduler Developers	6	1211	November 17, 2020

PP-506,PP-507: Add support for requesting resources with logical 'or' and conditional operators

Related topics