PBS configuration for HA and resource enforcement

Hello,

How to achieve the below configuration in PBSpro?

–PBS HA (master1 and master 2)
–checkpoint users’ jobs
–email job completion to the users.
–job statistics
–job filters
–restart jobs if node restarts or any issues in the service.
–all jobs run in shared mode
–Resources (CPU/memory) should be enforced. It should not be forked/increased by the user’s job.
–how to set GPU resource

Can we able to set up a job queue as below?

a) Large Queue

–Priority 100
–Include users/group
–Exclude users/group
–Max nodes: 5
–Max jobs : 4
–Max jobs per user: 2
–Max resources per node
–Checkpoint
–Wall clock limit: 1 day or 86400 seconds
–cpu limit
–gpu limit
–max slots/cpus/tasks
–Max idle time

Thanks for your inputs.

Please refer this document: High-performance Computing (HPC) and Cloud Solutions | Altair , for the sections mentioned below.

Chapter – 9.2 Failover
PBS Professional 19.2 Administrator’s Guide AG-391

Chapter – 9.3 Checkpoint and Restart
PBS Professional 19.2 Administrator’s Guide AG-413

qsub -m abe – /bin/sleep 1000
man qsub " should be able to help "

Please try these commands ( man qsub )
qstat -answ1
qstat -fx -F json
qstat -fx

Check qselect command ( man qselect )
qselect -s R

qmgr -c “set server node_fail_requeue=1”

Refer: 9.6.2 Node Fail Requeue: Jobs on Failed Vnodes
PBS Professional 19.2 Administrator’s Guide AG-439

-l

Refer: 4.7 Specifying Job Placement
PBS Professional 19.2 User’s Guide UG-65

Can be done using Cgroup hook . Please refer the cgroup section in the above guide/manual/documentation.

Refer: 5.14.7 Using GPUs
AG-280 PBS Professional 19.2 Administrator’s Guide

################
Large Queue
################

qmgr -c "create queue large  queue_type=execution,enabled=true,started=true"
qmgr -c "set queue large priority = 100"
qmgr -c "set queue large  acl_user_enable=true"   # user can be replaced with user(s), group(s), host(s)
qmgr -c "set queue large  acl_user+=user01@* "    # user can be replaced with user(s), group(s), host(s)
qmgr -c "set queue large  resources_available.ncpus=12"  # limiting to 12 cpu cores for this queue
qmgr -c "set queue large  resources_available.ngpus=2"   # limiting to 2 gpus for this queue
qmgr -c "set queue large resources_avaialble.walltime=24:00:00 # limiting to 24 hours of walltime
qmgr -c "set queue large resources_default.walltime=24:00:00" # if walltime is no specified, it is set to 24 hours
qmgr -c "set queue large resources_max.walltime=24:00:00"  # max request cannot exceed 24 hours
qmgr –c "s q radioss max_run = '[o:PBS_ALL=4]'"  # Limit the total number of running jobs for all users to 4
qmgr –c "s q radioss max_run += '[o:PBS_GENERIC=2]'" # limit each individual user to 2 jobs

Refer: 5.15.1.9 How to Set Limits at Server and Queues
PBS Professional 19.2 Administrator’s Guide AG-293

–Max nodes: 5
[Answer]: The scheduling of jobs is based on cores, not based on the number of nodes.

–max slots/cpus/tasks
[Answer]: This can be controlled using node configuration.
If you have 10 cores in a compute node, you can run a job(s) utilizing those 10 cores

–Max idle time
[Answer]: If the nodes are not running any jobs , they will be free to accept job.
Did not understand the requirement here for the queues ?

Thank you very much, Adarsh. Your insights were helpful for us to kick start the PBS running.

Hello,

Errors: PBSPro DBlock error for HA configuration.

Details:
Luster filesystem is mounted as /var/spool/pbs at master-1 and master-2 node at HPC cluster.
Primary PBS is master-1 and secondary PBS is master-2. This worked well for sometime. But after servers got rebooted, PBS service fails with DBlock error.

PBS_EXEC=/opt/pbs
PBS_SERVER=master
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=0
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp
PBS_PRIMARY=master1
PBS_SECONDARY=master2

What is causing this error? Is it right HA configuration ?
Installation engineer suggesting this could be due to lustre which not allowing pbs to run in HA,is it so?

Thank you.

Please note, i am just assuming here - more information would be seen in the PBS Server logs.
$PBS_HOME should be mounted on a file system that has file locking mechanism.
Lustre is good but file locking should not be client local.
For example; NFS file lock, any global file locking mechanism should be used.
I am not sure whether Lustre is a good for HA , it might have its drawbacks on large number of job-IO.

Please always make sure, you have some time gap in bring up/down primary and secondary during your tests, it might create a split brain situation where each server will think it has control, it is all due to the file locking mechanism.