I have posted a small design document to introduce a new “runone” job dependency which lets user submit multiple interdependent jobs with different resource requests. PBS ensures that only one of the dependent job requests runs and makes sure that once the running job ends all dependent jobs are deleted.
This gives users the potential to submit the same job with different resource requests and have PBS select whichever resource request it can run sooner.
Please have a look at the design and provide comments.
I wouldn’t say Interface 1. You are describing the whole feature
Change your example where you submit J1 and J2 before submitted J3 with the dependency. While this will work, it is not how we want to tell users to submit them. If you submit J1 and J2 and then submit the dependent job J3, it’s possible that J1 and J2 start running before you submit J3 to tie them all together. You should submit J1, then J2 with a runone dependency on J1, and then J3 with a runone dependency on J1.
Can you use an ‘on’ dependency on J1 to hold it until all the dependent jobs are submitted?
Say the reason why you hold the jobs instead of immediately delete them. Currently you have to kind of infer from the later bullet about when you requeue the job.
I wouldn’t use the abort record in the accounting log. I’d use the normal D record for delete. Just add another key-value pair on the line saying dependency=runone:9.centos or something like that. We can document that that means the job was deleted due to 9.centos completing. Your message is a little wordy for the accounting log. The accounting log should be more computer-parsing friendly.
I wouldn’t reject a job if its runone dependency is already running. I’d accept it and immediately put a system hold on it. If the running job is requeued, we want the entire dependency chain available to rerun.
Say what happens if the admin releases the system hold on one of the jobs. I expect it’ll be the same as if an admin releases the system hold on any other dependent job. The hold will be released and the job is available to be run.
I am not sure if we can do that, the reason being that ‘on’ dependency by definition is released after “count dependencies on other jobs have been satisfied”. This would mean that a job submitted with “on=4” would mean that 4 jobs would have to run before this dependency is released. Since this case is about “runone” dependency that means “on” dependency will never be met (if count is more than 1).
Maybe we do not want users to use “on” dependency because the purpose of this change is to run the jobs sooner. If the first job that they submit itself starts running then they don’t really have to submit more jobs. If I introduce a hold on the job unless the dependencies are met they we are essentially delaying the job startup.
Done
Actually this is not a new accounting log message. This is the exact same message and event that we use when we release the dependency and remove the dependent job. I am not sure changing this message is the right thing to do.
I agree, it would be better to accept the job. I’ve made this change
The way I understand the ‘on’ dependency is it a convenience feature (someone correct me if I am wrong). You submit one job with on=4. This will hold the job until you add 4 other dependent jobs to that one and then it’ll remove all the holds at one time. There is no reason you can’t do the same thing with holds, but it is just annoying. The main use of it now is with the before dependencies. You submit one job with an ‘on’ dependency and then submit a number of jobs that run before it.
The way I was thinking was just the convenience of it all. You submit on=4 and you submit all 4 jobs before the dependencies are released from all of them. Since we are thinking of them as a unit, this would allow them to be submitted as a unit.
Once again, it is a convenience feature, so it is not necessary.
Interesting. Since this is the same record used for other dependencies, we should use it for this one.