PP-928: Reliable Job Startup

So increment_chunks() should just take care of that and document it. If the first chunk is a single chunk, no change will be made to it. If the first chunk is 2 or more, the increase will apply to the chunks beyond the MS chunk (e.g. if select is 3:ncpus=2, then increment_chunks(“50%”) would return 4:ncpus=2).

I see. I never even thought of it this way. Sure, I can make increment_chunks() behave this way. So given select=3:ncpus=2 being the first chunk, then increment_chunks(“50%”) would leave “1:ncpus=2” alone, but apply the “50%” to the remaining “2:ncpus=2” so it becomes “3:ncpus=2”, and then all put back together as “4:ncpus=2”. I’ll update the EDD and code.

1 Like

I’ve updated the design to incorporate comments from Bhroam and Greg.

  • The ‘tolerate_node_failures’ attribute’s type has been changed from boolean to a string with valid values: “all”, “job_start”, and “none”. “all” is for tolerating node failures at any point in the job run. “job_start” is for tolerating failures or errors only during job start.
  • The increment_chunks() select method has been updated to leave as is the first chunk that is assigned by primary mom.

The updated design is in:
Reliable Job Startup Design v13

1 Like

I’ve updated the design, adding clarifications, and the latest is in (previous version is v13):

Reliable Job Startup Design v23

  • interface 1: Added the “from <node_host>” string to the message: “ignoring from <node_host> error as job is tolerant of node failures”

  • interface 4: Put in a restriction to the $job_launch_delay option on Windows as follows:
    “This option is currently not supported under Windows. NOTE: Allowing it would cause the primary mom to hang waiting on the job_launch_delay timeout, preventing other jobs from starting. This is because jobs are not pre-started in a forked child process, unlike in Linux/Unix systems.”

  • interface 5: Put in additional details on what primary mom sees as non-healthy sister hosts whose vnodes are not chosen when job is pruned via pbs.release_nodes(keep_select=X):

    • Any sister nodes that are able to join the job will be considered as healthy.
    • The sucess of join job request maybe the result of a check made by a remote execjob_begin hook. After successfully joining the job, the node may further check its status via a remote execjob_prologue hook. A reject by the remote prologue hook will cause primary mom to treat the sister node as a problem node, and will be marked as unhealthy. Unhealthy nodes are not selected when pruning a job’srequest via the pbs.release_nodes(keep_select) call (see interface 8 below).
    • If there’s an execjob_prologue hook in place, the primary mom would track node hosts that have given IM_ALL_OKAY acknowledgement for their execution of the execjob_prologue hook. Then after some ‘job_launch_delay’ amount of time of job startup (interface 4), primary mom would start reporting as failed nodes those who have not given their positive acknowledgement during prologue hook execution. This info is communicated to the child mom running on behalf of the job, so that vnodes from the failed hosts would not be used when pruning a job (i.e. pbs.release_nodes(keep_select=X) call).
    • If after some time, a node’s host comes back with an acknowledgement of successful prologue hook execution, the primary mom would add back the host to the healthy list.
  • interface 8: In regards to the pbs.event().job.release_nodes(keep_select=X) call:

    • This call makes sense only when job is node failure tolerant (i.e. tolerant_node_failures=job_start or tolerate_node_failures=all) since it is when the
      list of healthy and failed nodes are gathered to be consulted by release_nodes() for determining which chunk should be assigned, freed

    • Since execjob_launch hook will also get called when spawning tasks via pbsdsh, or tm_spawn, ensure the execjob_launch hook invoking release_nodes() call has ‘PBS_NODEFILE’ in the pbs.event().env list. The presence of ‘PBS_NODEFILE’ in the environment ensures that the primary mom is executing on behalf of starting the top level job, and not spawning a sister task. One can just add at the top of the hook:

      e=pbs.event()_
      
      if 'PBS_NODEFILE' not in e.env:_
      
         e.accept()
      
       j = e.job
      

      pj = j.release_nodes(keep_select=…)

Design document Reliable Job Startup Design (v26)
has been updated with the following info:

interface 1: The ‘tolerate_node_failures’ job option is currently not supported on Cray systems.
NOTE: This may work In a Cray system as this would apply to tolerating node failures of cray_login nodes (i.e. service nodes), which are the ones that run pbs_mom, and the nodes will not be part of an ALPS reservation.

interface 8: The pbs.event().job.release_nodes() call will fail (i.e. return a Python None) when the target job is assigned Cray X* series nodes (i.e. those nodes with “vntype=cray_” prefix as the resources_available.vntype value). The following DEBUG2 job-class, mom_logs message would be displayed:

“<job-id>;release_nodes(): not currently supported on Cray X* series nodes”
NOTE: The job’s assigned nodes/vnodes could be pruned to a smaller set via release_nodes() call, and it may include cray_compute nodes that are part of the initial ALPS reservation, and the reservation cannot be modified right now.

Change from previous version (v26):
Reliable Job Startup Design (v27)

BEFORE:
interface 1: The ‘tolerate_node_failures’ job option is currently not supported on Cray systems.
Note: This may work In a Cray system as this would apply to tolerating node failures of cray_login nodes (i.e. service nodes), which are the ones that run pbs_mom, and the nodes will not be part of an ALPS reservation.
AFTER:
interface 1: The ‘tolerate_node_failures’ job option is currently not supported on Cray systems. If specified, the Cray primary mom would just ignore the setting.

Also, the following text has been removed from interface 8 relating to pbs.event().job.release_nodes(keep_select):

'The pbs.event().job.release_nodes() call will fail (i.e. return a Python None) when the target job is assigned Cray X* series nodes (i.e. those nodes with “vntype=cray_” prefix as the resources_available.vntype value). The following DEBUG2 job-class, mom_logs message would be displayed:

“<jobid>;release_nodes(): not currently supported on Cray X* series nodes”
NOTE: The job’s assigned nodes/vnodes could be pruned to a smaller set via release_nodes() call, and it may include cray_compute nodes that are part of the initial ALPS reservation, and the reservation cannot be modified right now.’

For the following item related to item 8 (pbs.event().job.release_nodes(keep_select)):
"This call makes sense only when job is node failure tolerant (i.e. tolerant_node_failures=job_start or tolerate_node_failures=all) since it is when the
list of healthy and failed nodes are gathered to be consulted by release_nodes() for determining which chunk should be assigned, freed.

The following info has been added
"If it is invoked and yet the job is not tolerant of node failures, the following message is displayed in mom_logs under DEBUG level:
“<jobid>: no nodes released as job does not tolerate node failures”

Another update to the design of Reliable Job Startup - ver 28
to make it work with pbs_cgroups:

  1. Additional note to Interface 8: pbs.event().job.release_nodes(keep_select) method:
    “if pbs_cgroups is enabled ( PP-325 Support Cgroups), the cgroup already created for the job is also updated to match the job’s new resources. If the kernel rejects the update to the job’s cgroup resources, then the job will be aborted on the execution host side, and requeued/rerun on the server side.”

  2. New interface 9: new hook event execjob_configure
    used as an additional event for pbs_cgroups to respond to a change in job’s resources as a result of release_nodes() call.

Looks good @bayucan.

I dislike the name execjob_configure. It makes it sound like we’re doing configuration on the job. We’re not really, we’re resizing the job.

I’ve polled around the office and come up with four suggestions:
execjob_resize - this one offers some ability to expand in the future if we ever implement the ability for a job to grow after it has started.
execjob_ramp_down(or rampdown) - based off of the feature that it is being called for
execjob_release_nodes or execjob_relnodes - once again based off the command/method that is being called.
We have a server hook called modifyjob. We could expand the scope here a little and call this execjob_modify and call it when release_nodes() is called or if the server tells us to modify a job attribute. I find this one a little bit overloaded though.

What do you think?
Bhroam

@bhroam I can go with execjob_modify for potentially expanding this in the future to also include calling it if server tells us to modify the job attribute. Second choice is execjob_resize although this is more limiting in scope. @mkaro: what do you think?

We already have a qalter command. How about execjob_alter? Might a user calling qalter ultimately trigger such an event? If so, I think it’s best we keep the name generic but consistent with existing nomenclature.

@mkaro: actually for the server hook that responds to a qalter request, it’s ‘modifyjob’ as the name of the hook event, so that’s why I’m liking the ‘execjob_modify’ option. We don’t call it ‘qalterjob’. And yes, in the future, a user calling qalter might eventually end up executing this new’execjob_modify’ hook.

I’m fine with execjob_modify. What do you think @bhroam?

I’m fine with execjob_modify(), but I am worried if we don’t implement the whole event now, it will be confusing. Until we do, we’ll have the events modifyjob() and execjob_modify() which will do different things.

How hard is it to implement both, @bayucan
Bhroam

I agree, that would be confusing. What is the difference in effort to implement the “whole event”?

@bhroam @mkaro: Implementing the whole event now requires that that this hook be called when req_modifyjob() is called on the mom side, which happens when qalter is called and one of the following attributes are modified (based on man page): mppnodes, mppt, cput, walltime, min_walltime, and max_walltime. This means that pbs_cgroups would be called in this situation also, and we don’t want to re-populate the job’s cgroups files as these modifiable attributes have nothing to do with cgroups. I’d rather not do this because this falls as a scope creek to Reliable Job Startup. It won’t have anything to do with this feature.

If you guys are worried about if we use execjob_modify and it becomes confusing later on, perhaps we might just go with execjob_resize instead to really just apply to this feature.

@bayucan
I find it strange that min_walltime and max_walltime are sent to the mom. They are resources the scheduler uses to set the walltime. The person who implemented the feature must have just copied the walltime block in the resource definition file. The mpp resources are long since deprecated and probably should be removed at some point. This leaves cput and walltime. These resources have nothing to do with cgroups. If we went the modify route, we’d have to inform the hook event what is being modified, so they can ignore anything which they don’t care about.

The more I think about this, it just seems simpler to call it execjob_resize and have it just for the release nodes call. While it seems specific to this problem, it is a very important event to have a hook at.

Bhroam

Given the context of this discussion, I agree that execjob_resize is a suitable name for the new hook event.

@bhroam @mkaro: Cool! execjob_resize it is. Thanks, guys.

I’ve updated Reliable Job Startup Design (v29), renaming ‘execjob_configure’ as ‘execjob_resize’.