Add support in PTL to speed up deletion of large number of jobs

lsubramanian · January 17, 2019, 1:31am

We have couple of PTL performance tests that run large number of jobs. The tearDown() of these tests or setUp() of subsequent tests clean up by deleting these jobs using qdel. This qdel operation takes very long for reasons as mentioned in PP-439. Consequently these tests timeout.
Now looking at possible solutions for this issue from PTL:

PTL tests turn off scheduling before qdel. This is beneficial, but when most jobs are in ’running’ state, turning the scheduling off before qdel ‘alone’ did not give improvements.
In the qdel operation, most time goes in server<=>MoM interactions during killing job processes. The solution to address this, could be to write a custom PTL function, say cleanjobs_for_perf_tests() that does the following

for job in Jobs:

Get pids from jobs’s session_id attribute
Kill the process (kill -9 )
Cleanup contents for this job in mom_priv directory
Delete job from server using qdel -Wforce

This function should be used only with tearDown of tests that deal with huge number of jobs. This should not be used in tests that test ‘qdel’ or related features.

I experimented this solution with one of the perf tests and it brought down the qdel time considerably.

Please give your suggestions/comments on this approach .

agrawalravi90 · January 17, 2019, 7:04am

This might be a dumb idea, but since this is for PTL, and for when we want to revert the system to defaults, how about stopping PBS, deleting PBS_HOME and starting it back up? This should delete all jobs and revert PBS to its default configuration, and take only a few seconds.

anamika · January 22, 2019, 11:44pm

Thanks @lsubramanian. your approach looks good to me.

bhroam · January 23, 2019, 1:16am

Hey @lsubramanian
I like most of what you are saying, but I worry about step 2.3. You are not using a supported interface here. If at some point in future we change how jobs are stored on the mom (e.g. we use the database), this step will fail.

You are also cleaning up something out from underneath mom. You don’t know how mom will react to this when it is doing its own cleanup (e.g. calling end hooks).

What I would suggest doing is to just skip 2.3. Once all the processes are dead, the mom is going to clean up the job in the right way. By the time the mom reports back to the server, we’ll probably have already done the qdel -Wforce and the server will tell the mom to dump the job. Even if we haven’t done the qdel, the server will tell the mom to start end of job processing. At some point during that, the qdel will happen and the server will tell the mom to dump the job then.

As a note, step 1 (turning scheduling off) is important for another reason. A scheduling cycle will start during the job deletion. If you have a significant number of jobs, this cycle can take quite a while. Not only this, the server is restarted during revert_to_defaults(). This means the scheduler could be in cycle connected to a dead server at the start of the subsequent test. Since the server doesn’t have an active connection to the scheduler, it won’t know the scheduler is in cycle and will try and talk to it. It is just a mess. By turning off scheduling before you delete the jobs, this cycle won’t happen.

@agrawalravi90 Your approach is interesting. It moves all of the reverting to defaults out of PTL’s hands and into pbs_habitats hands. This is a script that is required for PBS to run properly, so it could work. Although for the same reason I don’t like step 2.3, I’m not sure how much I like this. You’re doing a non-standard operation. The server or mom might react badly to the home directory going away. It’ll probably be fine. The daemon processes go down. We delete home. We use the init script to start back up. The init script runs habitat to recreate home.

This does cause issues with our previously discussed project to have PTL run with the current state of the system. This method will only revert back to default. It’s a bit of a huge hammer approach.

Bhroam

lsubramanian · January 24, 2019, 7:10am

Thanks @agrawalravi90 , @bhroam and @anamika , for adding your thoughts here!

@Bhroam. I agree and understand that the ‘manual cleanup of job processes’ approach is not using a supported interface and is not a reliable long term solution.
@agrawalravi90, I am also not fully convinced with the ’ $PBS_HOME dir deletion approach’ - the downside being we lose all the daemon logs. Even if one used the ‘—post-data-analysis’ switch with pbs_benchpress, server and sched changes could be preserved only on test failures.
From the test execution point of view, I wonder if losing the daemon logs for this cleanup is a good trade-off.

So I could only conclude on : turn scheduling off before job deletion and continue using qdel.
(not qdel -Wforce as it only removes jobs from the server queue; job processes are not guarantee removed from the system. If job processes remain, pbs_mom fails to restart (in the setUp()) in the test that immediately follow, causing test failures)

bhroam · January 24, 2019, 6:44pm

@lsubramanian I’m not opposed to killing the job processes, I was opposed to deleting the job files from the mom. If you kill all the processes, mom will notice the job has finished and clean up herself. This with a qdel -Wforce should stop normal end of job processing and speed things up.

Bhroam

lsubramanian · January 25, 2019, 6:48am

@bhroam, Thanks for clarifying this again. I wrongly read that you wanted to skip both items 2 and 3 in the original post (instead of 2.3) before. I got your point now . I did experiment around this and as you mentioned, on killing the processes, mom cleaned up the job dirs herself. This improved performance.

agrawalravi90 · January 25, 2019, 7:08am

@bhroam and @lsubramanian thanks for entertaining my crazy idea I had suggested it for situations where we don’t care about any past data or state and just want to revert PBS quickly (I do think this might be the fastest way to do it). I tried it out on my system and it didn’t seem to cause any issues, but ya, it is quite non-standard. I wonder if we could add something in PBS to “refresh” itself, but it’s probably not something that any customer would want.

I do like the improvements you are suggesting overall. Thanks for taking this up!

lsubramanian · February 6, 2019, 1:16am

Hi All,
Please take a look at the design in https://pbspro.atlassian.net/wiki/spaces/PD/pages/1049821187/Support+in+PTL+for+deletion+of+large+number+of+jobs.

anamika · February 6, 2019, 5:48am

Thanks Latha. content looks good. I suggest using design format like following for both cleanup_large_num_jobs and cleanup_jobs changes.

Interface: cleanup_large_num_jobs(job_ids=None, runas=None)
Visibility: Private
Change Control: Stable
Synopsis: Delete large number of jobs. Will be called from cleanup_jobs if number of jobs in queue are more than 100.
Details: * This function will get the process ids of the running jobs and kill them manually. It would then delete jobs from server using ‘qdel -Wforce’.

lsubramanian · February 6, 2019, 7:16pm

Thanks @anamika. I have made changes to the design as requested. Please take a look. Thanks!

anamika · February 6, 2019, 11:58pm

Thanks @lsubramanian. I do not see you have mentioned any where about turning scheduling off before deleting the job and then turning it back on.

for existing interface changes:
cleanup_jobs(extend=None, runas=None)
Synopsis: updated to handle deletion of large number of jobs
Details: method is now updated to delete large number of jobs.

if number of jobs are less than 100 then it use qdel. if number of jobs are more than 100 in queue then it calls _cleanup_large_num_jobs().
also scheduling will be turned off before job deletion and turned back on before exiting

Similarly update tearDown as well. no need to explain what is there in the function already. you can point to the existing documentation at https://www.pbspro.org/ptldocs/index.html

Topic		Replies	Views
Deleting 150k+ queued jobs Users/Site Administrators	9	1457	September 16, 2020
Add a mock run option to pbs_mom for testing Developers	16	773	March 3, 2020
Is qdel for a job array with 5k supposed to stop the scheduler? Users/Site Administrators	5	420	July 2, 2021
Cannot delete Job after Checkpoint/Restart Users/Site Administrators	3	945	August 14, 2018
PP-829: Preemption via deletion Developers	4	1028	June 28, 2017

Add support in PTL to speed up deletion of large number of jobs

Related topics