Running jobs become zombie jobs

After all nodes in a PBS cluster suddenly lost power and were restarted, jobs that were previously in the R state (i.e., running jobs) in the queue remained in the R state after the cluster recovered. However, the compute nodes no longer had any associated processes, rendering them zombie jobs. Checking mom_logs shows a failure to read the .JB file or no job information. How should I address this? This jobs occupied my node resources, preventing subsequent jobs from running. They should continue running or exit with failure, freeing resources for subsequent queued jobs to execute

If I don’t check and address this, I can’t handle it in time., causing subsequent queued jobs to remain unrunnable.

Some content in mom_logs:
;0001;pbs_mom;Svr;pbs_mom;proc_get btime, fscanf failed. ERR : Inappropriate ioctl for device

;0001;pbs_mom;Svr;pbs_mom;job_recov_fs, error reading fixed portion of /var/spool/pbs/mom_priv/jobs/3001.vm002.JB

There are many layers of the os/hpc/applicaiton/services stack that needs to be loaded when the system is booted after power is restored. Also, the PBS Server and PBS nodes should boot up in the same state/order that it was before to run the jobs that are in the queued state. When a node is rebooted, the processes connected to the job are lost, here both server and node had power interuption.

Checking mom_logs shows a failure to read the .JB file or no job information. How should I address this?

>> PBS Server had this job information in its database while there was power interrupt , the state of the jobs (info) were stored. When the server came back to life, jobs were recovered . The .jb files were corrupted or lost hence or not accessible on the compute nodes, hence the job remain stuck on the server ( you can qdel/qdel -Wforce these jobs, otherwise, the server would think the nodes are running these jobs, as there was no proper cleanup or exit of these jobs, also these nodes might not be available to run other jobs with this situation at hand).

Hope this helps

How do I resume the running jobs after all nodes are restarted?

These are the options

  1. qrerun the job
  2. delete job and resubmit it