Running jobs become zombie jobs

wakaka · September 30, 2025, 10:19am

After all nodes in a PBS cluster suddenly lost power and were restarted, jobs that were previously in the R state (i.e., running jobs) in the queue remained in the R state after the cluster recovered. However, the compute nodes no longer had any associated processes, rendering them zombie jobs. Checking mom_logs shows a failure to read the .JB file or no job information. How should I address this? This jobs occupied my node resources, preventing subsequent jobs from running. They should continue running or exit with failure, freeing resources for subsequent queued jobs to execute

If I don’t check and address this, I can’t handle it in time., causing subsequent queued jobs to remain unrunnable.

Some content in mom_logs:
;0001;pbs_mom;Svr;pbs_mom;proc_get btime, fscanf failed. ERR : Inappropriate ioctl for device
…
;0001;pbs_mom;Svr;pbs_mom;job_recov_fs, error reading fixed portion of /var/spool/pbs/mom_priv/jobs/3001.vm002.JB

adarsh · September 30, 2025, 6:30pm

There are many layers of the os/hpc/applicaiton/services stack that needs to be loaded when the system is booted after power is restored. Also, the PBS Server and PBS nodes should boot up in the same state/order that it was before to run the jobs that are in the queued state. When a node is rebooted, the processes connected to the job are lost, here both server and node had power interuption.

Checking mom_logs shows a failure to read the .JB file or no job information. How should I address this?

>> PBS Server had this job information in its database while there was power interrupt , the state of the jobs (info) were stored. When the server came back to life, jobs were recovered . The .jb files were corrupted or lost hence or not accessible on the compute nodes, hence the job remain stuck on the server ( you can qdel/qdel -Wforce these jobs, otherwise, the server would think the nodes are running these jobs, as there was no proper cleanup or exit of these jobs, also these nodes might not be available to run other jobs with this situation at hand).

Hope this helps

wakaka · October 9, 2025, 1:34am

How do I resume the running jobs after all nodes are restarted?

adarsh · October 10, 2025, 7:04am

These are the options

qrerun the job
delete job and resubmit it

Topic		Replies	Views
Jobs stuck in R status after power failure Users/Site Administrators	9	167	July 4, 2024
Automatic re-queue of failed jobs Users/Site Administrators	1	2538	September 20, 2017
Revalidating nodes Users/Site Administrators	15	698	June 13, 2025
Job Transit To Another Node Users/Site Administrators	5	201	May 14, 2024
PBS is requeuing hundreds of thousands of old jobs on start (takes over 30 min to start) Users/Site Administrators	6	869	August 17, 2022

Running jobs become zombie jobs

Related topics