Job state is running, but qstat -f comment = Not Running: Strict fifo order

Hi all!
I am new to OpenPBS, and having this problem.
Was looking for answer on the internet, but didn’t find any solution.
I have cluster with 30 nodes, two queues, named Big and Batch.
Sometimes when I check status of running jobs, i see that all jobs from queue Batch hang in state “Not Running: Strict fifo order”. But this jobs already started and using all available nodes, so new jobs can’t start. Only way to solve this is to qdel this jobs, then new jobs start normally.

current sched_config:
round_robin: False all
by_queue: True prime
by_queue: True non_prime
strict_fifo: True ALL
fair_share: false ALL
help_starving_jobs false ALL
sort_queues true ALL
load_balancing: false ALL
log_filter: 256
dedicated_prefix: ded
max_starve: 24:00:00

Please help me to solve this issue!

First off, why strict_fifo is long since deprecated. Please use strict_ordering instead

The message you are seeing is because you have strict_fifo on without backfill. This will truly give you strict fifo scheduling, but at the major cost of utilization. It basically means when a job can not run, scheduling stops. No other job can be run until that job is run.

You should see this message on all the jobs after the running one. If you are seeing this message on a running job, that is strange. Was the job run via qrun -H? If so, the scheduler doesn’t get involved with running the job. Otherwise, it should have a message about where and when the job was run.

I highly suggest you look into backfilling. This will let the scheduler calculate when topjobs can run, and fit smaller jobs in front of them which won’t change their start time. This really helps with lost utilization. To do this, you set backfill_depth on the sched object via qmgr (qmgr -c ‘s sched backfill_depth=N’). You set it to the number of top jobs. The higher you set this number, the better the scheduler will do at preserving the strict order, but the slower the cycles will become. With 30 nodes, you should be able to set it pretty high.

1 Like

Please see section 4.9.20, “FIFO Scheduling”, in the PBS Professional Administrator’s Guide, especially subsection, “FIFO with Strict Ordering and Backfilling”.

1 Like

Thanks for such detailed feedback!
Turns out our company using Torque, i was confused by caption “OpenPBS” in sched_config…
Considering to migrate to OpenPBS, as more flexible batch system.