Unable to submit the job on compute node(node01)

Hi,
I have installed Pbspro 19.1.1 on Master(pbs server) and node01

In master i can able to submit the job and i am getting STDIN.o3 STDIN.e3 files from pbsdata user(pbsdata user is in master machine)
qsub -l select=1:ncpus=1:mem=100mb:host=master – /bin/sleep 10

And, When i submit the job to node01
qsub -l select=1:ncpus=1:mem=100mb:host=node01 – /bin/sleep 10

i am getting
[pbsdata@master ~]$ qstat -ans

master:
Req’d Req’d Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time


7.master pbsdata workq STDIN – 1 1 100mb – H –

job held, too many failed attempts to run

my logs:
cat /var/spool/pbs/mom_logs/20200304 |grep 7.master
03/04/2020 12:57:44;0028;pbs_mom;Job;7.master;No Password Entry for User pbsdata
03/04/2020 12:57:44;0008;pbs_mom;Job;7.master;kill_job
03/04/2020 12:57:44;0100;pbs_mom;Job;7.master;node01 cput= 0:00:00 mem=0kb
03/04/2020 12:57:44;0008;pbs_mom;Job;7.master;no active tasks
03/04/2020 12:57:44;0100;pbs_mom;Job;7.master;Obit sent
03/04/2020 12:57:44;0080;pbs_mom;Job;7.master;delete job request received
03/04/2020 12:57:44;0008;pbs_mom;Job;7.master;kill_job

[pbsdata@master ~]$ pbsnodes -a
master
Mom = master
ntype = PBS
state = free
pcpus = 2
resources_available.arch = linux
resources_available.host = master
resources_available.mem = 2046864kb
resources_available.ncpus = 2
resources_available.vnode = master
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
last_state_change_time = Wed Mar 4 13:10:28 2020
last_used_time = Wed Mar 4 14:24:20 2020

node01
Mom = node01
ntype = PBS
state = free
pcpus = 2
resources_available.arch = linux
resources_available.host = node01
resources_available.mem = 2046864kb
resources_available.ncpus = 2
resources_available.vnode = node01
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
last_state_change_time = Wed Mar 4 13:10:28 2020
last_used_time = Wed Mar 4 13:21:27 2020

[pbsdata@master ~]$

[pbsdata@master ~]$ cat /etc/pbs.conf
PBS_EXEC=/share/apps/platform/pbs
PBS_HOME=/var/spool/pbs
PBS_SERVER=master
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=1
PBS_CORE_LIMIT=unlimited
PBS_RCP=/bin/false
PBS_SCP=/usr/bin/scp
PBS_RSHCOMMAND=/usr/bin/ssh

[root@node01 ~]# cat /etc/pbs.conf
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_SERVER=master
PBS_START_SERVER=0
PBS_START_SCHED=0
PBS_START_COMM=0
PBS_START_MOM=1
PBS_CORE_LIMIT=unlimited
PBS_RCP=/bin/false
PBS_SCP=/usr/bin/scp
PBS_RSHCOMMAND=/usr/bin/ssh

Please help me to submit the job successfully from pbsdata user(on master machine) to node01.

Regards,
Zain

You got to have pbsdata user on node01.

Henry Wu|吴光宇

+1 @wgy

can you ssh login into node01 as “pbsdata” ? and does the home directory for pbsdata exists ?

2 Likes

+1 @wgy @adarsh

By default, the job owner is pbsdata but you can specify -u option to use different user on compute nodes.

   -u user_list
           List  of usernames.  Job is run under a username from this list.  Sets job's User_List attribute to
           user_list.  Default: job owner (username on submit host.)  Format of user_list:

                  user[@host][,user@host ...]

Please follow this link

~/.rhosts should be populated with host(s) and otherusername(s) , if userA wants to submit job(s) as otheruser(s)

Thank you your response Adarsh,
pbsdata user home directory not exists on node01 and i have created home dir and passwordless ssh between pbsdata users of master and node01.

Now, i am able to submit the jobs to node01.

finally, can you explain me, that the PBS pro cluster needs to have the same user on all the VMs in this cluster?
For Ex:
Master (PBS Server/ Head node/login node)
Node01 (computenode)
Node02 (computenode)

Here all the nodes should have test users with passwordless ssh b/w test users.

Regards,
Zain

In any cluster environment, the user needs to have seamless SSH access with HostKeyChecking disabled or approved. across the nodes of the cluster. Basically, the no password should be asked when ssh’ing

  • master / headnode to compute node(s)
  • compute node(s) to headnode
  • compute node(s) to compute node(s)

Usually, in a cluster environment the USER HOME DIRECTORY would be common (mounted across) across all the compute nodes. Also , NIS / PBIS / others might have been used for storing client/server/user details.

Thanks for clarifying ssh across the nodes.

Please guide me to restrict the cores based on user/group else all users when they submitting the job.
For example:
Master - 24c
Node01 - 24c
total 48c
IT group use only 10cores
Bio group use only 20cores
Chem group use only 10cores
how can i restrict with their default queues, this scenarios??
if incase i want to use (top priority) 40c then how can submit the job?

Please guide me to configure this scenarious.

Regards,
Zain

Please refer this documentation:

and this section 5.15.1.9.ii Examples of Setting Server and Queue Limits
This will cover all your use cases .

Are these ( IT, Bio, Chem) Linux groups ?

Thanks for sharing link with session Adarsh.
I will check and get back to you.

Regards,
Zain

Hi @adarsh I reinstalled our compute node and installed PBS.
After which now all submitted jobs are held . qstat -answ and qstat -f all says
job held, too many failed attempts to run

I probed deeper and below are some lines of the logs

02/11/2021 15:26:02;0008;pbs_mom;Job;111852.master1.local;kill_job
02/11/2021 15:26:02;0100;pbs_mom;Job;111852.master1.local;compute-0-1 cput= 0:00:00 mem=0kb
02/11/2021 15:26:02;0008;pbs_mom;Job;111852.master1.local;no active tasks
02/11/2021 15:26:02;0100;pbs_mom;Job;111852.master1.local;Obit sent
02/11/2021 15:26:02;0080;pbs_mom;Job;111852.master1.local;delete job request received
02/11/2021 15:26:02;0008;pbs_mom;Job;111852.master1.local;kill_job
02/11/2021 15:26:02;0028;pbs_mom;Job;111852.master1.local;No Password Entry for User = jkzabee
02/11/2021 15:26:02;0008;pbs_mom;Job;111852.master1.local;kill_job
02/11/2021 15:26:02;0100;pbs_mom;Job;111852.master1.local;compute-0-1 cput= 0:00:00 mem=0kb
02/11/2021 15:26:02;0008;pbs_mom;Job;111852.master1.local;no active tasks
02/11/2021 15:26:02;0100;pbs_mom;Job;111852.master1.local;Obit sent
02/11/2021 15:26:02;0080;pbs_mom;Job;111852.master1.local;delete job request received
02/11/2021 15:26:02;0008;pbs_mom;Job;111852.master1.local;kill_job

Please advice Thanks

Please check the compute for

  1. whether user home direct exists
  2. can you login as that user on that compute node without any issues (passwordless ssh is working both ways server to node , node to server)
  3. check whether you can login to that compute node as that user with a password

These hints would help to resolve. Thank you for sharing

Hi @adarsh.
When I ssh to compute node it asks for a password.

I also realized that the compute node has no records of the users. Can you advice how to update the compute node with the users account.

For the home directory we use a shared directory which is mounted on all nodes

  1. manually create users with the same password on the compute nodes and reference to the mounted home directory in the /etc/passwd file – after doing this make sure you can login and get the home directory at the prompt

  2. Using NIS or PBIS or LDAP

Hi @adarsh.
I have created both the user account on both login and compute. Now i can ssh to the compute-node with out issues. But the job being in the held state persists. It first gets queued and qstat -f says Not Running: Insufficient amount of resource: host

[quote=“adarsh, post:14, topic:2039”]
Using NIS or PBIS or LDAP
[/quote] Hi We are not using any of those three we create users with the conventional linux useradd

Please share the qstat -fx < of that job id> and pbsnodes -av output.
The job request could not be satisfied by any of the available nodes, hence the scheduler message.

Hello Adarsh,

Thanks for the help. I was able to finally get PBS to run the jobs.
What I found out was that the user accounts did not exist on compute nodes. Manually creating the same user on the compute nodes did not work. What worked was to copy /etc/passwd , from the masternode to all the compute-nodes. I also reinstalled PBS on the compute nodes as well. Thanks once again

1 Like

Dear @vincent718 , I am having a similar issue.So you copy passwd file in head note to slave nodes and replaced the passwd file with the ones in the slaves nodes?

Yes. That is what I did. You should also copy the group file

1 Like