New job attribute to measure a job's total wait time

Hi,

I’m proposing a new job attribute which can measure a job’s total wait time:
https://pbspro.atlassian.net/wiki/spaces/PD/pages/900333574/New+job+attribute+to+measure+a+job+s+total+wait+time

Some background:
Today, there’s no good way to measure the total time that a job waited in the Q state throughout its lifetime. You can do start time - queue time of a job, but it is not accurate if the job got preempted or qmove’d. Eligible time is also not accurate as it stops accruing for jobs which are over their limits, besides being an optional feature which many sites might not turn on. So, a new job attribute which measures the total time that a job waited in the Q state will be useful for statistical and accounting purposes.

Request you guys to review the design and provide feedback.

Thanks!
Ravi

2 Likes

Hey @agrawalravi90
Thanks for writing this up. My only comment is about time in the scheduler suspended state. If a job is preempted via suspension, shouldn’t that be considered as wait time? I wouldn’t consider time in the user suspended state as wait time. That’s more like being held.

Now that we’re talking about another state, you might be specific that other states are not counted against wait like (including the waiting state).

Bhroam

Hi @agrawalravi90, who will consume this time and how?

This will be a very useful measure to have available for accounting purposes. That said, a few thoughts/questions:

(1) How often is the wait_time accrual updated?
(2) It’s not clear what the right behavior is for preempted jobs – is that like held (not accruing wait_time) or is that like queued (accruing wait_time)?
(3) If a job is moved between servers, and we now have two job records, does the record on the target server inherit the original server’s record’s wait_time, or start from 0?
(4) If a job is requeued, the wait_time accrual should continue where it left off.

More generally, I see value in being able to pull out figures for three types of wait time:
(1) trivially, start time minus queue time
(2) eligible_time
(3) wait_time as proposed here

Depending on site and policy, any of the three could be of greater interest, so I’m glad we are getting attention on this.

Thanks for your feedback @bhroam, @anamika and @mbonyak, I clearly didn’t think this feature through, I’ll circle back after reviewing my approach.