I wanted to start a sort of brainstorming discussion on how to fix the problem highlighted in the title. I have been investigating the slowness in scheduling cycle in PBS.
In my setup, sched cycle takes around 3 minutes to run 50k jobs in real PBS. I realized that sched was spending majority of its time just waiting for an ACK from server, more specifically, it was spending 161 seconds out of the total time of 172 seconds, just waiting for the ACK. This is 94% of the total time. So, I thought of removing it (https://github.com/PBSPro/pbspro/pull/1597), but realized that scheduler relies on this reply in case a runjob hook rejects a job so that it can free up those resources for other jobs, otherwise such resources might get booked by the same job every cycle and cause under-utilization.
So, I’m hoping that we can come up with a way to work around this. A few possible options:
Add a sched attribute which will tell scheduler to either care/not care about runjob hooks, so that it won’t wait for an ACK if it’s been told to not care about runjob hooks.
Penalize such jobs so that scheduler gives priority to other jobs and those resources get used.
Mark such jobs as Held for 1/few cycles so that those resources get consumed by other jobs. Since cycles will be much faster, such jobs might not have to wait that much.
Make runjob a scheduler hook instead of server
option 4) might intuitively make the most sense, but it’s going to be a lot of work. option 1) seems the safest of the rest, but it’ll restrict users who want a faster sched and use a runjob hook. 2) and 3) both penalize the job, but 3) seems to be better, the penalty is just a delay in scheduling, which might not be that much if sched cycles take just 12 seconds instead of 3 minutes. Let me know your thoughts on these, or if you can think of other ways to solve this.
Also, I’d like to know some numbers on how users use runjob hooks. Do they on average reject more than 10% of jobs that sched asks server to run? more than 50%? Even a rough estimate will help test the solutions out.