PP-1287: do not purge moved job from history before the job is finished

vchlum · July 30, 2018, 8:13am

Hello,

As requested in the PR, I have written the design doc for the issue.

My suggestion is to purge the M job after it is really finished on the target server.

First, It does not make sense - the M job is removed before it is actually finished. I understand that the job is still trackable through the tracking file but we should be able to qstat both the M job and the job on the target server during the whole job life cycle. For example, we implemented a web application that informs users about their jobs and it uses the API, but it is better for us to ignore M jobs at all because we can not rely on the M job being in the system.

Second, there is a bug with dependency: Knowing the job is still running, you should be able to add a new dependent job with -W depend=afterok:<moved jobid>. If you try to add this dependency after job_history_duration, the <moved jobid> is purged and the <moved jobid> is unknown to the dependency.

There is an imperfection in this solution, which is that the M job is not stored with job_history_enable = False. I think we should keep the M jobs to be a part of the history. I am not sure how to deal with it.

Please, provide your comments.

Vasek

arungrover · July 31, 2018, 12:21am

@vchlum Document looks good to me.
It may not need to go in the design doc but I think with this change in place, we should also add a check to purge the moved job at the place we set its substate to finished, provided it has exceeded the job history duration. Otherwise it may linger around until the next job history work task runs.
What do you think?

vchlum · July 31, 2018, 11:38am

The history work task runs every two minutes. I would not consider it necessary but it seems to be a good idea. OK, I agree.

I also have a question concerning the PTL test. I am not able to start the PTL test with two servers. I suppose the correct syntax to start it is: ‘pbs_benchpress -p servers=<server1,server2> …’ but there is only one server available:

self.logger.info("servers: " + str(self.servers.keys()))

shows:

2018-07-31 13:27:31,087 INFO servers: [‘took27’]

Using: ‘pbs_benchpress -p servers=server1:server2 …’ the server2 is added as a node to server1.

Are multiple servers supported in PTL?

hirenvadalia · July 31, 2018, 3:42pm

@vchlum Can you please try with self.servers.host_keys()? If you still face problem then please provide exact pbs_benchpress command and its output (i.e. -o file)

arungrover · July 31, 2018, 4:28pm

I agree with you assessment of not considering the change given that history task runs every 2 mins.

For your PTL test I think you might have found a bug in PTL framework, if you see both the servers showing up as nodes then please try it like this -
-p servers=s1:s2,moms=s1
This will make sure that second server is not considered as the node.

vchlum · August 1, 2018, 11:29am

Thank you @arungrover and @hirenvadalia. The workaround ‘-p servers=s1:s2,moms=s1’ works like a charm.

Topic		Replies	Views
PBS is requeuing hundreds of thousands of old jobs on start (takes over 30 min to start) Users/Site Administrators	6	831	August 17, 2022
Trying to move a job crashes server Users/Site Administrators	4	181	January 22, 2024
Ignoring finished dependencies Developers	8	1804	April 20, 2021
Add support in PTL to speed up deletion of large number of jobs Developers	11	969	February 6, 2019
Polling jobs on mom Developers	1	307	July 3, 2023

PP-1287: do not purge moved job from history before the job is finished

Related topics