Can't use resources ngpus

When I qsub a job, it was always in Q state, I can’t find the problem it has.


The status(systemctl status pbs) of all pbs nodes have no problem. And the server_log shows that:

pbsnodes -aSj:

I can’t unset the resources_available.ngpus of the node e006, it always be the 1/1.
When it had no job in pbs, I can’t delete resource ngpus either. This is difficult。How can I fix my pbs?
image

From the above output, it seems

  1. The vnode state is not correct and hence job is in the queue
  2. Also you cannot delete the resource ngpus, because some jobs might have requesetd that ngpus and also jobs stored in the history might have this resource ngpus requested
  • so you need to delete all the jobs that have requested ngpus (queued and running)
  • delete jobs from the history that have requested ngpus or you can reset the job history

Then you can delete the resource ngpus.

Thanks for your reply, Mr.adarsh. All jobs in pbs have been deleted, I use qstat -a to check, it outputs nothing. It should be occupied somewhere, but I can’t find it out. I can’t delete the node or the resource.
image

I found that it had this extra [0], but I haven’t done any settings about this.


When I delete e006[0] first, I can delete e006.
image
I don’t know how [0] came about.

To view jobs:

qstat -x    : all jobs 
qstat -H     :  only jobs that have completed, moved, deleted
qstat -xH    : includes both of the above
qstat -fx | grep -e Job.ID -e ngpus

To view nodes

pbsnodes -av
pbsnodes -aSjv

Yes, if you have vnode configuration ( configv2 or via cgroups)

  • you would need to delete the vnodes first [ cnode[0] , cnode[1]) and then the natural cnode cnode