New "runone" job dependency

arungrover · February 7, 2020, 10:16pm

Hi All,

I have posted a small design document to introduce a new “runone” job dependency which lets user submit multiple interdependent jobs with different resource requests. PBS ensures that only one of the dependent job requests runs and makes sure that once the running job ends all dependent jobs are deleted.

This gives users the potential to submit the same job with different resource requests and have PBS select whichever resource request it can run sooner.

Please have a look at the design and provide comments.

Thanks,
Arun

bhroam · February 8, 2020, 1:11am

Hey @arungrover
I have a few comments

I wouldn’t say Interface 1. You are describing the whole feature
Change your example where you submit J1 and J2 before submitted J3 with the dependency. While this will work, it is not how we want to tell users to submit them. If you submit J1 and J2 and then submit the dependent job J3, it’s possible that J1 and J2 start running before you submit J3 to tie them all together. You should submit J1, then J2 with a runone dependency on J1, and then J3 with a runone dependency on J1.
Can you use an ‘on’ dependency on J1 to hold it until all the dependent jobs are submitted?
Say the reason why you hold the jobs instead of immediately delete them. Currently you have to kind of infer from the later bullet about when you requeue the job.
I wouldn’t use the abort record in the accounting log. I’d use the normal D record for delete. Just add another key-value pair on the line saying dependency=runone:9.centos or something like that. We can document that that means the job was deleted due to 9.centos completing. Your message is a little wordy for the accounting log. The accounting log should be more computer-parsing friendly.
I wouldn’t reject a job if its runone dependency is already running. I’d accept it and immediately put a system hold on it. If the running job is requeued, we want the entire dependency chain available to rerun.
Say what happens if the admin releases the system hold on one of the jobs. I expect it’ll be the same as if an admin releases the system hold on any other dependent job. The hold will be released and the job is available to be run.

Bhroam

arungrover · February 10, 2020, 10:59pm

Thanks for reviewing the document Bhroam!

Done

I am not sure if we can do that, the reason being that ‘on’ dependency by definition is released after “count dependencies on other jobs have been satisfied”. This would mean that a job submitted with “on=4” would mean that 4 jobs would have to run before this dependency is released. Since this case is about “runone” dependency that means “on” dependency will never be met (if count is more than 1).
Maybe we do not want users to use “on” dependency because the purpose of this change is to run the jobs sooner. If the first job that they submit itself starts running then they don’t really have to submit more jobs. If I introduce a hold on the job unless the dependencies are met they we are essentially delaying the job startup.

Done

Actually this is not a new accounting log message. This is the exact same message and event that we use when we release the dependency and remove the dependent job. I am not sure changing this message is the right thing to do.

I agree, it would be better to accept the job. I’ve made this change

I’ve added this change.

Please have a look at the document again.

Thanks,
Arun

bhroam · February 10, 2020, 11:18pm

The way I understand the ‘on’ dependency is it a convenience feature (someone correct me if I am wrong). You submit one job with on=4. This will hold the job until you add 4 other dependent jobs to that one and then it’ll remove all the holds at one time. There is no reason you can’t do the same thing with holds, but it is just annoying. The main use of it now is with the before dependencies. You submit one job with an ‘on’ dependency and then submit a number of jobs that run before it.

The way I was thinking was just the convenience of it all. You submit on=4 and you submit all 4 jobs before the dependencies are released from all of them. Since we are thinking of them as a unit, this would allow them to be submitted as a unit.

Once again, it is a convenience feature, so it is not necessary.

Interesting. Since this is the same record used for other dependencies, we should use it for this one.

Bhroam

Topic		Replies	Views
Ignoring finished dependencies Developers	8	1817	April 20, 2021
Dependencies not being killed when job fails Developers	1	773	March 23, 2017
PP-506,PP-507: Add support for requesting resources with logical 'or' and conditional operators Developers	63	7344	May 23, 2017
PP-479: Running subjobs to be able to survive a pbs_server restart Developers	41	4253	May 14, 2018
PP-465: qrerun timeouts when big job files are being copied from MoM to server Developers	44	4155	November 15, 2016

New "runone" job dependency

Related topics