Hi,
I have installed Pbspro 19.1.1 on Master(pbs server) and node01
In master i can able to submit the job and i am getting STDIN.o3 STDIN.e3 files from pbsdata user(pbsdata user is in master machine)
qsub -l select=1:ncpus=1:mem=100mb:host=master – /bin/sleep 10
And, When i submit the job to node01
qsub -l select=1:ncpus=1:mem=100mb:host=node01 – /bin/sleep 10
i am getting
[pbsdata@master ~]$ qstat -ans
master:
Req’d Req’d Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
7.master pbsdata workq STDIN – 1 1 100mb – H –
–
job held, too many failed attempts to run
my logs:
cat /var/spool/pbs/mom_logs/20200304 |grep 7.master
03/04/2020 12:57:44;0028;pbs_mom;Job;7.master;No Password Entry for User pbsdata
03/04/2020 12:57:44;0008;pbs_mom;Job;7.master;kill_job
03/04/2020 12:57:44;0100;pbs_mom;Job;7.master;node01 cput= 0:00:00 mem=0kb
03/04/2020 12:57:44;0008;pbs_mom;Job;7.master;no active tasks
03/04/2020 12:57:44;0100;pbs_mom;Job;7.master;Obit sent
03/04/2020 12:57:44;0080;pbs_mom;Job;7.master;delete job request received
03/04/2020 12:57:44;0008;pbs_mom;Job;7.master;kill_job
By default, the job owner is pbsdata but you can specify -u option to use different user on compute nodes.
-u user_list
List of usernames. Job is run under a username from this list. Sets job's User_List attribute to
user_list. Default: job owner (username on submit host.) Format of user_list:
user[@host][,user@host ...]
Thank you your response Adarsh,
pbsdata user home directory not exists on node01 and i have created home dir and passwordless ssh between pbsdata users of master and node01.
Now, i am able to submit the jobs to node01.
finally, can you explain me, that the PBS pro cluster needs to have the same user on all the VMs in this cluster?
For Ex:
Master (PBS Server/ Head node/login node)
Node01 (computenode)
Node02 (computenode)
Here all the nodes should have test users with passwordless ssh b/w test users.
In any cluster environment, the user needs to have seamless SSH access with HostKeyChecking disabled or approved. across the nodes of the cluster. Basically, the no password should be asked when ssh’ing
master / headnode to compute node(s)
compute node(s) to headnode
compute node(s) to compute node(s)
Usually, in a cluster environment the USER HOME DIRECTORY would be common (mounted across) across all the compute nodes. Also , NIS / PBIS / others might have been used for storing client/server/user details.
Please guide me to restrict the cores based on user/group else all users when they submitting the job.
For example:
Master - 24c
Node01 - 24c
total 48c
IT group use only 10cores
Bio group use only 20cores
Chem group use only 10cores
how can i restrict with their default queues, this scenarios??
if incase i want to use (top priority) 40c then how can submit the job?
Hi @adarsh I reinstalled our compute node and installed PBS.
After which now all submitted jobs are held . qstat -answ and qstat -f all says
job held, too many failed attempts to run
I probed deeper and below are some lines of the logs
02/11/2021 15:26:02;0008;pbs_mom;Job;111852.master1.local;kill_job
02/11/2021 15:26:02;0100;pbs_mom;Job;111852.master1.local;compute-0-1 cput= 0:00:00 mem=0kb
02/11/2021 15:26:02;0008;pbs_mom;Job;111852.master1.local;no active tasks
02/11/2021 15:26:02;0100;pbs_mom;Job;111852.master1.local;Obit sent
02/11/2021 15:26:02;0080;pbs_mom;Job;111852.master1.local;delete job request received
02/11/2021 15:26:02;0008;pbs_mom;Job;111852.master1.local;kill_job
02/11/2021 15:26:02;0028;pbs_mom;Job;111852.master1.local;No Password Entry for User = jkzabee
02/11/2021 15:26:02;0008;pbs_mom;Job;111852.master1.local;kill_job
02/11/2021 15:26:02;0100;pbs_mom;Job;111852.master1.local;compute-0-1 cput= 0:00:00 mem=0kb
02/11/2021 15:26:02;0008;pbs_mom;Job;111852.master1.local;no active tasks
02/11/2021 15:26:02;0100;pbs_mom;Job;111852.master1.local;Obit sent
02/11/2021 15:26:02;0080;pbs_mom;Job;111852.master1.local;delete job request received
02/11/2021 15:26:02;0008;pbs_mom;Job;111852.master1.local;kill_job
manually create users with the same password on the compute nodes and reference to the mounted home directory in the /etc/passwd file – after doing this make sure you can login and get the home directory at the prompt
Hi @adarsh.
I have created both the user account on both login and compute. Now i can ssh to the compute-node with out issues. But the job being in the held state persists. It first gets queued and qstat -f says Not Running: Insufficient amount of resource: host
[quote=“adarsh, post:14, topic:2039”]
Using NIS or PBIS or LDAP
[/quote] Hi We are not using any of those three we create users with the conventional linux useradd
Please share the qstat -fx < of that job id> and pbsnodes -av output.
The job request could not be satisfied by any of the available nodes, hence the scheduler message.
Thanks for the help. I was able to finally get PBS to run the jobs.
What I found out was that the user accounts did not exist on compute nodes. Manually creating the same user on the compute nodes did not work. What worked was to copy /etc/passwd , from the masternode to all the compute-nodes. I also reinstalled PBS on the compute nodes as well. Thanks once again
Dear @vincent718 , I am having a similar issue.So you copy passwd file in head note to slave nodes and replaced the passwd file with the ones in the slaves nodes?