Adding GPU nodes

We have 4 compute nodes each with one RTX2080 GPU installed on all.

I am trying to follow the procedure -

Method I am trying -
“Simple GPU Scheduling with Exclusive Node Access”

When I run command -
qmgr -c “create resource rtx2080 type=Boolean,flag=h”

It gives error as below -
qmgr obj=rtx2080 svr=default: Illegal attribute or resource value
qmgr: Error (15014) returned from server

Screenshot -

Any one can help.

Please try this command (it is recommended to type it , sometimes copy paste has issues )
qmgr -c "create resource rtx2080 type=boolean,flag=h"

Hi Adarsh,

Thanks it worked. So I added the resource to a node named - b14
Below is output of pbsnodes -a for b14 - Where it shows resources_available.rtx2080 = True

b14
Mom = b14.shukra.aero.iitb.ac.in
ntype = PBS
state = free
pcpus = 32
resources_available.arch = linux
resources_available.host = b14
resources_available.mem = 32948660kb
resources_available.ncpus = 32
resources_available.rtx2080 = True
resources_available.vnode = b14
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
last_state_change_time = Wed Jul 6 09:00:56 2022


Now how do I submit cuda job to that node -
Tried as written in the document. but it gives error -

Hi @adarsh I followed your steps from here - PBS Single exection host run job using cpu include gpu - #2 by adarsh

And its working fine. I can submit a cuda job.
by doing
#PBS -l select=1:ncpus=1:ngpus=1

But now a strange line is coming in all output files -

/var/spool/pbs/mom_priv/jobs/187.b0.SC: line 14: =1: command not found

This has started to appear only after this ngpus resource addition. I restarted pbs service.
Do i need to check something.

Thank you

Please share your job submission script or the line number 14 in your job submission script

Hi @adarsh

Below is the script -

#!/bin/bash

#PBS -N testing

#PBS -q big

#PBS -l select=1:ncpus=1:ngpus=1

#PBS -j oe

#PBS -V

#PBS -o log.out

cd $PBS_O_WORKDIR

cat $PBS_NODEFILE > ./pbsnodes

#$PROCS1=cat ./pbsnodes|wc -l

nvcc -o g.out hello-world.cu

$HOME/gpu/g.out

/bin/hostname


After commenting $PROCS1=cat ./pbsnodes|wc -l
The error has gone. But earlier it used to work without any error. Not sure why this line has issues now.

Could you plase replace that line with these
PROCS1=$(cat $PBS_NODEFILE | wc -l)

You can remove this line

Hi @adarsh Thanks a lot . it worked.

1 Like