Unable to submit the job on compute node(node01)

zainul1114 · March 4, 2020, 9:21am

Hi,
I have installed Pbspro 19.1.1 on Master(pbs server) and node01

In master i can able to submit the job and i am getting STDIN.o3 STDIN.e3 files from pbsdata user(pbsdata user is in master machine)
qsub -l select=1:ncpus=1:mem=100mb:host=master – /bin/sleep 10

And, When i submit the job to node01
qsub -l select=1:ncpus=1:mem=100mb:host=node01 – /bin/sleep 10

i am getting
[pbsdata@master ~]$ qstat -ans

master:
Req’d Req’d Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

7.master pbsdata workq STDIN – 1 1 100mb – H –
–
job held, too many failed attempts to run

my logs:
cat /var/spool/pbs/mom_logs/20200304 |grep 7.master
03/04/2020 12:57:44;0028;pbs_mom;Job;7.master;No Password Entry for User pbsdata
03/04/2020 12:57:44;0008;pbs_mom;Job;7.master;kill_job
03/04/2020 12:57:44;0100;pbs_mom;Job;7.master;node01 cput= 0:00:00 mem=0kb
03/04/2020 12:57:44;0008;pbs_mom;Job;7.master;no active tasks
03/04/2020 12:57:44;0100;pbs_mom;Job;7.master;Obit sent
03/04/2020 12:57:44;0080;pbs_mom;Job;7.master;delete job request received
03/04/2020 12:57:44;0008;pbs_mom;Job;7.master;kill_job

[pbsdata@master ~]$ pbsnodes -a
master
Mom = master
ntype = PBS
state = free
pcpus = 2
resources_available.arch = linux
resources_available.host = master
resources_available.mem = 2046864kb
resources_available.ncpus = 2
resources_available.vnode = master
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
last_state_change_time = Wed Mar 4 13:10:28 2020
last_used_time = Wed Mar 4 14:24:20 2020

node01
Mom = node01
ntype = PBS
state = free
pcpus = 2
resources_available.arch = linux
resources_available.host = node01
resources_available.mem = 2046864kb
resources_available.ncpus = 2
resources_available.vnode = node01
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
last_state_change_time = Wed Mar 4 13:10:28 2020
last_used_time = Wed Mar 4 13:21:27 2020

[pbsdata@master ~]$

[pbsdata@master ~]$ cat /etc/pbs.conf
PBS_EXEC=/share/apps/platform/pbs
PBS_HOME=/var/spool/pbs
PBS_SERVER=master
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=1
PBS_CORE_LIMIT=unlimited
PBS_RCP=/bin/false
PBS_SCP=/usr/bin/scp
PBS_RSHCOMMAND=/usr/bin/ssh

[root@node01 ~]# cat /etc/pbs.conf
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_SERVER=master
PBS_START_SERVER=0
PBS_START_SCHED=0
PBS_START_COMM=0
PBS_START_MOM=1
PBS_CORE_LIMIT=unlimited
PBS_RCP=/bin/false
PBS_SCP=/usr/bin/scp
PBS_RSHCOMMAND=/usr/bin/ssh

Please help me to submit the job successfully from pbsdata user(on master machine) to node01.

Regards,
Zain

wgy · March 4, 2020, 2:08pm

You got to have pbsdata user on node01.

Henry Wu|吴光宇

adarsh · March 4, 2020, 4:28pm

+1 @wgy

can you ssh login into node01 as “pbsdata” ? and does the home directory for pbsdata exists ?

cherryppju · March 4, 2020, 8:28pm

+1 @wgy @adarsh

By default, the job owner is pbsdata but you can specify -u option to use different user on compute nodes.

   -u user_list

           List  of usernames.  Job is run under a username from this list.  Sets job's User_List attribute to
           user_list.  Default: job owner (username on submit host.)  Format of user_list:

                  user[@host][,user@host ...]

adarsh · March 4, 2020, 10:15pm

Please follow this link

~/.rhosts should be populated with host(s) and otherusername(s) , if userA wants to submit job(s) as otheruser(s)

zainul1114 · March 5, 2020, 4:40am

Thank you your response Adarsh,
pbsdata user home directory not exists on node01 and i have created home dir and passwordless ssh between pbsdata users of master and node01.

Now, i am able to submit the jobs to node01.

finally, can you explain me, that the PBS pro cluster needs to have the same user on all the VMs in this cluster?
For Ex:
Master (PBS Server/ Head node/login node)
Node01 (computenode)
Node02 (computenode)

Here all the nodes should have test users with passwordless ssh b/w test users.

Regards,
Zain

adarsh · March 5, 2020, 11:53am

In any cluster environment, the user needs to have seamless SSH access with HostKeyChecking disabled or approved. across the nodes of the cluster. Basically, the no password should be asked when ssh’ing

master / headnode to compute node(s)
compute node(s) to headnode
compute node(s) to compute node(s)

Usually, in a cluster environment the USER HOME DIRECTORY would be common (mounted across) across all the compute nodes. Also , NIS / PBIS / others might have been used for storing client/server/user details.

zainul1114 · March 5, 2020, 1:26pm

Thanks for clarifying ssh across the nodes.

Please guide me to restrict the cores based on user/group else all users when they submitting the job.
For example:
Master - 24c
Node01 - 24c
total 48c
IT group use only 10cores
Bio group use only 20cores
Chem group use only 10cores
how can i restrict with their default queues, this scenarios??
if incase i want to use (top priority) 40c then how can submit the job?

Please guide me to configure this scenarious.

Regards,
Zain

adarsh · March 6, 2020, 9:02am

Please refer this documentation:

and this section 5.15.1.9.ii Examples of Setting Server and Queue Limits
This will cover all your use cases .

Are these ( IT, Bio, Chem) Linux groups ?

zainul1114 · March 6, 2020, 9:16am

Thanks for sharing link with session Adarsh.
I will check and get back to you.

Regards,
Zain

vincent718 · February 11, 2021, 2:59pm

Hi @adarsh I reinstalled our compute node and installed PBS.
After which now all submitted jobs are held . qstat -answ and qstat -f all says
job held, too many failed attempts to run

I probed deeper and below are some lines of the logs

02/11/2021 15:26:02;0008;pbs_mom;Job;111852.master1.local;kill_job
02/11/2021 15:26:02;0100;pbs_mom;Job;111852.master1.local;compute-0-1 cput= 0:00:00 mem=0kb
02/11/2021 15:26:02;0008;pbs_mom;Job;111852.master1.local;no active tasks
02/11/2021 15:26:02;0100;pbs_mom;Job;111852.master1.local;Obit sent
02/11/2021 15:26:02;0080;pbs_mom;Job;111852.master1.local;delete job request received
02/11/2021 15:26:02;0008;pbs_mom;Job;111852.master1.local;kill_job
02/11/2021 15:26:02;0028;pbs_mom;Job;111852.master1.local;No Password Entry for User = jkzabee
02/11/2021 15:26:02;0008;pbs_mom;Job;111852.master1.local;kill_job
02/11/2021 15:26:02;0100;pbs_mom;Job;111852.master1.local;compute-0-1 cput= 0:00:00 mem=0kb
02/11/2021 15:26:02;0008;pbs_mom;Job;111852.master1.local;no active tasks
02/11/2021 15:26:02;0100;pbs_mom;Job;111852.master1.local;Obit sent
02/11/2021 15:26:02;0080;pbs_mom;Job;111852.master1.local;delete job request received
02/11/2021 15:26:02;0008;pbs_mom;Job;111852.master1.local;kill_job

Please advice Thanks

adarsh · February 11, 2021, 7:08pm

Please check the compute for

whether user home direct exists
can you login as that user on that compute node without any issues (passwordless ssh is working both ways server to node , node to server)
check whether you can login to that compute node as that user with a password

These hints would help to resolve. Thank you for sharing

vincent718 · February 11, 2021, 9:57pm

Hi @adarsh.
When I ssh to compute node it asks for a password.

I also realized that the compute node has no records of the users. Can you advice how to update the compute node with the users account.

For the home directory we use a shared directory which is mounted on all nodes

adarsh · February 12, 2021, 8:40am

manually create users with the same password on the compute nodes and reference to the mounted home directory in the /etc/passwd file – after doing this make sure you can login and get the home directory at the prompt
Using NIS or PBIS or LDAP

vincent718 · February 12, 2021, 9:56am

Hi @adarsh.
I have created both the user account on both login and compute. Now i can ssh to the compute-node with out issues. But the job being in the held state persists. It first gets queued and qstat -f says Not Running: Insufficient amount of resource: host

[quote=“adarsh, post:14, topic:2039”]
Using NIS or PBIS or LDAP
[/quote] Hi We are not using any of those three we create users with the conventional linux useradd

adarsh · February 12, 2021, 4:41pm

Please share the qstat -fx < of that job id> and pbsnodes -av output.
The job request could not be satisfied by any of the available nodes, hence the scheduler message.

vincent718 · February 12, 2021, 10:25pm

Hello Adarsh,

Thanks for the help. I was able to finally get PBS to run the jobs.
What I found out was that the user accounts did not exist on compute nodes. Manually creating the same user on the compute nodes did not work. What worked was to copy /etc/passwd , from the masternode to all the compute-nodes. I also reinstalled PBS on the compute nodes as well. Thanks once again

watzinki · March 14, 2022, 7:50pm

Dear @vincent718 , I am having a similar issue.So you copy passwd file in head note to slave nodes and replaced the passwd file with the ones in the slaves nodes?

vincent718 · March 18, 2022, 9:42am

Yes. That is what I did. You should also copy the group file

Topic		Replies	Views
My job stay queued Users/Site Administrators	24	10512	January 27, 2020
Cannot submit a job from computing nodes Users/Site Administrators	1	498	August 26, 2022
Configure PBS Pro with Multiple Execution Hosts Users/Site Administrators	16	8774	February 24, 2017
Looking for a "get started guide" Developers	28	6880	April 22, 2020
Starting the MoM Users/Site Administrators	36	14429	February 12, 2021

Unable to submit the job on compute node(node01)

Related topics