I’ve configured Torque PBS cluster with 2 machines: ip1 and ip2. ip1 acts as a server with the torque-server, torque-mom and torque-scheduler installed. ip2 is just a node with torque-mom. The configuration is ok, pbsnodes on both machines returns
cuda
state = free
np = 16
ntype = cluster
status = rectime=1519887342,varattr=,jobs=,state=free,netload=2829068930,gres=cuda:,loadave=0.50,ncpus=16,physmem=132036652kb,availmem=134818552kb,totmem=135943208kb,idletime=2822,nusers=2,nsessions=2,sessions=1363 4658,uname=Linux cuda 4.2.0-42-generic #49~14.04.1-Ubuntu SMP Wed Jun 29 20:22:11 UTC 2016 x86_64,opsys=linux
cuda2
state = free
np = 4
ntype = cluster
status = rectime=1519887335,varattr=,jobs=,state=free,netload=71522585,gres=,loadave=0.00,ncpus=4,physmem=16432464kb,availmem=18032520kb,totmem=18384204kb,idletime=2880,nusers=3,nsessions=15,sessions=1575 1584 1604 1646 1647 1648 1649 1650 1651 1653 1655 1703 1726 18189 18257,uname=Linux IU6-2 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64,opsys=linux
Only ip1 will be used by multiple users to run jobs, so to prevent torque using scp while file transferring, I’ve also configured nfs server on ip1, mapped /home folder on ip1 to /mnt/home on ip2 and according to 13.9.2.1 “Configuring the $usecp MoM Parameter” https://pbsworks.com/documentation/support/PBSProAdminGuide12.pdf added
$usecp ip1:/home/ /mnt/home/
to the file /var/spool/torque/mom_priv/config on ip2. Then I’ve tried to run simple script with qsub on both nodes:
#!/bin/bash
#PBS -l nodes=2
#PBS -k o
#PBS -j oe
$PBS_O_WORKDIR/test
In stat -f output I see:
//…
job_state = C
//…
exit_status = -1
//…
But there are no output files. And in mom logs in ip1:
pbs_mom;Job;274.localhost;ERROR: received request ‘ABORT_JOB’ from ip2:1023 for job ‘274.localhost’ (job does not exist locally)
What am I doing wrong?
Thanks in advance.