SSSD integration with PBSpro

Hi guys,
I’ve installed SSSD service authenticate with windows AD server for user account management. It allow me to create a HPC group and allocate hpc user in the group. I can ssh headnode. It will create /home/user@doman folder as user home directory. But when I switch to the AD user account it won’t let me run the job. It would be greate if anyone can give me some help. Thanks. below are the error information.

qsub: Bad UID for job execution
qsub: Bad UID for job execution
qsub: Bad UID for job execution
qsub: Bad UID for job execution
qsub: Bad UID for job execution
qsub: Bad UID for job execution
qsub: Bad UID for job execution
qsub: Bad UID for job execution
qsub: Bad UID for job execution
qsub: Bad UID for job execution

I am runing 10 job at the same time as a test.my job detail are :
#!/bin/bash

#PBS -l nodes=1:ppn=4,walltime=600

cd $PBS_O_WORKDIR

for ((i=100000; i<200000; i++))
{
echo “$i” >> test.txt
}

exit 0;

Hi Joey,
You need to add users to acl_roots list to allow users to submit jobs.

To add a user to acl_roots, you can use following command :
qmgr -c “set server acl_roots+=username”

Regards
Dilip

1 Like

Hmm when I was getting this error:

qsub: Bad UID for job execution

It was because I was starting the job as the root on the server. Try to do it as non root and remember that users on the execution nodes has to have the same name as the user on the server which submits the jobs.

Cheers

See section 14.7.4 of the PBS Pro Administrator’s Guide located here: http://www.pbsworks.com/SupportGT.aspx?d=PBS-Professional,-Documentation

Hi dilip-krishnan,
Thanks for your reply. I tried to add the test@domain.local account with qmgr command qmgr -c “set server acl_roots+=test” or qmgr -c “set server acl_roots+=test@domain.local” on the headnode. then switch to test user to run the job. Unfortunately it still says : qsub: Bad UID for job execution. Any idea what might be wrong? Thanks

Hi Joey,
If your PBS server is installed on different machine than from where you are trying to submit the job. The same user should exist on the all the mom nodes and server node. Also please set in server
qmgr -c “set server flatuid=true”

Regards
Dilip

Hi dilip-krishnan,
Thanks for your reply.We have four servers in total. One headnode which we submit job from. The other three are all compute node. I am trying to use System Security Services Daemon (SSSD) authenticate user against our AD so that user can ssh to the headnode with AD credential instead of maintaining local /etc/passwd. The authentication process works fine on the headnode or compute node. I can ssh with AD credential to headnode or compute node. But when submit job with AD account ie. test@domain.local,it complained :qsub: Bad UID for job execution.
I tried to add the test@domain.local account with both qmgr command qmgr -c “set server acl_roots+=test” and qmgr -c “set server acl_roots+=test@domain.local” still get the same error.I tried qmgr -c “set server flatuid=true” this tiem is different error :qsub: Unauthorized Request .Please help. Thanks

Hi Joey,
Please share excerpt from server log for unauthorized request. If logs doesn’t have much information, then please increase the log level and repeat the scenario.

Also not relevant to the question, but are you able to ssh from head node to compute node using
AD user account, i.e firs login to headnode with AD user account and then do ssh to compute node and vice versa.

Regards
Dilip

Hi Dilip,
Sorry for the late reply. I was stuck with some other project. Below are the server_log after I ran the job with ad account agagin. Basicly all the jobs been rejected by the server. I checked the reference for the code=15023, (Missing userID, username, or GID). I am not sure if PBS support user account authenticated by AD through SSSD service. BTW,the AD account can passwordless ssh between headnode and compute node.

07/27/2016 09:05:00;0040;Server@cciavmlhpct1;Svr;cciavmlhpct1;Scheduler sent command 3
07/27/2016 09:05:00;0040;Server@cciavmlhpct1;Svr;cciavmlhpct1;Scheduler sent command 0
07/27/2016 09:05:00;0100;Server@cciavmlhpct1;Req;;Type 21 request received from Scheduler@cciavmlhpct1, sock=16
07/27/2016 09:05:00;0100;Server@cciavmlhpct1;Req;;Type 81 request received from Scheduler@cciavmlhpct1, sock=16
07/27/2016 09:05:00;0100;Server@cciavmlhpct1;Req;;Type 71 request received from Scheduler@cciavmlhpct1, sock=16
07/27/2016 09:05:00;0100;Server@cciavmlhpct1;Req;;Type 58 request received from Scheduler@cciavmlhpct1, sock=16
07/27/2016 09:05:00;0100;Server@cciavmlhpct1;Req;;Type 20 request received from Scheduler@cciavmlhpct1, sock=16
07/27/2016 09:05:00;0100;Server@cciavmlhpct1;Req;;Type 51 request received from Scheduler@cciavmlhpct1, sock=16
07/27/2016 09:15:00;0040;Server@cciavmlhpct1;Svr;cciavmlhpct1;Scheduler sent command 3
07/27/2016 09:15:00;0040;Server@cciavmlhpct1;Svr;cciavmlhpct1;Scheduler sent command 0
07/27/2016 09:15:00;0100;Server@cciavmlhpct1;Req;;Type 21 request received from Scheduler@cciavmlhpct1, sock=16
07/27/2016 09:15:00;0100;Server@cciavmlhpct1;Req;;Type 81 request received from Scheduler@cciavmlhpct1, sock=16
07/27/2016 09:15:00;0100;Server@cciavmlhpct1;Req;;Type 71 request received from Scheduler@cciavmlhpct1, sock=16
07/27/2016 09:15:00;0100;Server@cciavmlhpct1;Req;;Type 58 request received from Scheduler@cciavmlhpct1, sock=16
07/27/2016 09:15:00;0100;Server@cciavmlhpct1;Req;;Type 20 request received from Scheduler@cciavmlhpct1, sock=16
07/27/2016 09:15:00;0100;Server@cciavmlhpct1;Req;;Type 51 request received from Scheduler@cciavmlhpct1, sock=16
07/27/2016 09:25:00;0040;Server@cciavmlhpct1;Svr;cciavmlhpct1;Scheduler sent command 3
07/27/2016 09:25:00;0040;Server@cciavmlhpct1;Svr;cciavmlhpct1;Scheduler sent command 0
07/27/2016 09:25:00;0100;Server@cciavmlhpct1;Req;;Type 21 request received from Scheduler@cciavmlhpct1, sock=16
07/27/2016 09:25:00;0100;Server@cciavmlhpct1;Req;;Type 81 request received from Scheduler@cciavmlhpct1, sock=16
07/27/2016 09:25:00;0100;Server@cciavmlhpct1;Req;;Type 71 request received from Scheduler@cciavmlhpct1, sock=16
07/27/2016 09:25:00;0100;Server@cciavmlhpct1;Req;;Type 58 request received from Scheduler@cciavmlhpct1, sock=16
07/27/2016 09:25:00;0100;Server@cciavmlhpct1;Req;;Type 20 request received from Scheduler@cciavmlhpct1, sock=16
07/27/2016 09:25:00;0100;Server@cciavmlhpct1;Req;;Type 51 request received from Scheduler@cciavmlhpct1, sock=16
07/27/2016 09:35:00;0040;Server@cciavmlhpct1;Svr;cciavmlhpct1;Scheduler sent command 3
07/27/2016 09:35:00;0040;Server@cciavmlhpct1;Svr;cciavmlhpct1;Scheduler sent command 0
07/27/2016 09:35:00;0100;Server@cciavmlhpct1;Req;;Type 21 request received from Scheduler@cciavmlhpct1, sock=16
07/27/2016 09:35:01;0100;Server@cciavmlhpct1;Req;;Type 81 request received from Scheduler@cciavmlhpct1, sock=16
07/27/2016 09:35:01;0100;Server@cciavmlhpct1;Req;;Type 71 request received from Scheduler@cciavmlhpct1, sock=16
07/27/2016 09:35:01;0100;Server@cciavmlhpct1;Req;;Type 58 request received from Scheduler@cciavmlhpct1, sock=16
07/27/2016 09:35:01;0100;Server@cciavmlhpct1;Req;;Type 20 request received from Scheduler@cciavmlhpct1, sock=16
07/27/2016 09:35:01;0100;Server@cciavmlhpct1;Req;;Type 51 request received from Scheduler@cciavmlhpct1, sock=16
07/27/2016 09:44:32;0100;Server@cciavmlhpct1;Req;;Type 0 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:32;0100;Server@cciavmlhpct1;Req;;Type 49 request received from jcao@cciamr.local@cciavmlhpct1, sock=19
07/27/2016 09:44:32;0100;Server@cciavmlhpct1;Req;;Type 21 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:32;0100;Server@cciavmlhpct1;Req;;Type 1 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0080;Server@cciavmlhpct1;Req;req_reject;Reject reply code=15023, aux=0, type=1, from jcao@cciamr.local@cciavmlhpct1
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 0 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 49 request received from jcao@cciamr.local@cciavmlhpct1, sock=19
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 21 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 1 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0080;Server@cciavmlhpct1;Req;req_reject;Reject reply code=15023, aux=0, type=1, from jcao@cciamr.local@cciavmlhpct1
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 0 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 49 request received from jcao@cciamr.local@cciavmlhpct1, sock=19
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 21 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 1 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0080;Server@cciavmlhpct1;Req;req_reject;Reject reply code=15023, aux=0, type=1, from jcao@cciamr.local@cciavmlhpct1
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 0 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 49 request received from jcao@cciamr.local@cciavmlhpct1, sock=19
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 21 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 1 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0080;Server@cciavmlhpct1;Req;req_reject;Reject reply code=15023, aux=0, type=1, from jcao@cciamr.local@cciavmlhpct1
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 0 request received from jcao@cciamr.local@cciavmlhpct1, sock=19
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 49 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 21 request received from jcao@cciamr.local@cciavmlhpct1, sock=19
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 1 request received from jcao@cciamr.local@cciavmlhpct1, sock=19
07/27/2016 09:44:33;0080;Server@cciavmlhpct1;Req;req_reject;Reject reply code=15023, aux=0, type=1, from jcao@cciamr.local@cciavmlhpct1
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 0 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 49 request received from jcao@cciamr.local@cciavmlhpct1, sock=19
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 21 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 1 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0080;Server@cciavmlhpct1;Req;req_reject;Reject reply code=15023, aux=0, type=1, from jcao@cciamr.local@cciavmlhpct1
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 0 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 49 request received from jcao@cciamr.local@cciavmlhpct1, sock=19
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 21 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 1 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0080;Server@cciavmlhpct1;Req;req_reject;Reject reply code=15023, aux=0, type=1, from jcao@cciamr.local@cciavmlhpct1
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 0 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 49 request received from jcao@cciamr.local@cciavmlhpct1, sock=19
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 21 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 1 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0080;Server@cciavmlhpct1;Req;req_reject;Reject reply code=15023, aux=0, type=1, from jcao@cciamr.local@cciavmlhpct1
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 0 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 49 request received from jcao@cciamr.local@cciavmlhpct1, sock=19
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 21 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 1 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0080;Server@cciavmlhpct1;Req;req_reject;Reject reply code=15023, aux=0, type=1, from jcao@cciamr.local@cciavmlhpct1
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 0 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 49 request received from jcao@cciamr.local@cciavmlhpct1, sock=19
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 21 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0100;Server@cciavmlhpct1;Req;;Type 1 request received from jcao@cciamr.local@cciavmlhpct1, sock=16
07/27/2016 09:44:33;0080;Server@cciavmlhpct1;Req;req_reject;Reject reply code=15023, aux=0, type=1, from jcao@cciamr.local@cciavmlhpct1

Hi Joey,
You are right that PBS doesn’t currently support SSSD. Looking at the log, the user submitting the
job is also not in correct format , instead of user as user@hostname, PBS server is getting user@lhostname@something. Internally PBS consider everything after @ as hostname. This could be one of the reason why PBS is not able to authorize the qsub request.

We are using pbspro (version 13.1) with SSSD. It works for us. This looks more
like some kind of mis configuration, could be a username or hostname with an @ in it.

Taco

Hi Taco,
This is how we configure the SSSD. Please advise where it went wrong. Many thanks
First, install the realmd package:

  1. yum install -y realmd

  2. realm discover cciamr.local

cciamr.local
type: kerberos
realm-name: CCIAMR.LOCAL
domain-name: cciamr.local
configured: no
server-software: active-directory
client-software: sssd
required-package: oddjob
required-package: oddjob-mkhomedir
required-package: sssd
required-package: adcli
required-package: samba-common
3.# yum install -y oddjob oddjob-mkhomedir sssd adcli samba-common
4 # realm join --user=jcao.da cciamr.local
5 add the default domain suffix to the sssd configuration file:
vim /etc/sssd/sssd.conf
domains = cciamr.local
config_file_version = 2
services = nss, pam
#default_domain_suffix = cciamr.local
[domain/cciamr.local]
ad_domain = cciamr.local
krb5_realm = CCIAMR.LOCAL
realmd_tags = manages-system joined-with-samba
cache_credentials = True
id_provider = ad
krb5_store_password_if_offline = True
default_shell = /bin/bash
ldap_id_mapping = True
use_fully_qualified_names = True
fallback_homedir = /home/%u@%d
access_provider = simple
6 # service sssd restart
7 # realm permit -g hpc@cciamr.local
8 # realm discover cciamr.local
cciamr.local
type: kerberos
realm-name: CCIAMR.LOCAL
domain-name: cciamr.local
configured: kerberos-member
server-software: active-directory
client-software: sssd
required-package: oddjob
required-package: oddjob-mkhomedir
required-package: sssd
required-package: adcli
required-package: samba-common
login-formats: %U@cciamr.local
login-policy: allow-permitted-logins
permitted-logins:
permitted-groups: hpc@cciamr.local

BTW. I can ssh to the headnode with AD user account in hpc group and ssh from headnode to compute node too.

Hi Joey,
Not sure, but I guess setting login-formats as ‘%U’ and use_full_qualified_names=False
can work. But that is just a rough guess.

Regards
Dilip

We don’t use AD as backend but LDAP, but that should not make much difference.
Dilip’s suggestions seem like something you should try.

Also, the fallback_homedir is odd for unix based directories, but that is another guess
and I see %U and %u both used, not sure if that matters but good to look into.

Another question is: what do you use to log in, user@domain or just user?

Let us know how this works out,

Taco

Hi guys,
Thanks for your reply. Before any change in sssd.conf. I ssh to server with username instead of username@domain. I got the home directory /home/user@domain. Now I removed #default_domain_suffix = cciamr.local, set login-formats as ‘%u’ and use_full_qualified_names=“False” in sssd.conf. Restart the service. I ssh with user, got the /home/user directory. Now the job is working. Thanks for your help. guys :blush:

Good that you have things working now, this one was kind of complicated to figure out.

Taco

Hello Guys,
I have a similar problem only that in my case the job status says ‘H’

Following the information above I created a test queue but still job status says ‘H’. Below is my queue configuration

qmgr -c ‘p q test’

Create queues and set their attributes.

Create and define queue test

create queue test
set queue test queue_type = Execution
set queue test acl_user_enable = True
set queue test acl_users = dummyuser
set queue test acl_users += users
set queue test acl_group_enable = True
set queue test acl_groups = domain
set queue test acl_groups += users
set queue test enabled = True
set queue test started = True

By the way , our cluster uses ROCKS 7 . Could that also be a factor?

Hello @vincent718,

Please share the output of “qstat -f [jobid]” and “tracejob [jobid]” for one of the held jobs.

Thanks!

Tracejob
tracejob: Couldn’t find Job Id 109871.master1.local in logs of past 1 day

qstat -f [jobid]
Job Id: 109871.master1.local
Job_Name = test.sh
Job_Owner = ztest@login.local
job_state = H
queue = workq
server = master1.local
Checkpoint = u
ctime = Wed Feb 26 09:29:07 2020
Error_Path = login.local:/home/ztest/error.txt
Hold_Types = s
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Wed Feb 26 09:29:13 2020
Output_Path = login.local:/home/ztest/out.txt
Priority = 0
qtime = Wed Feb 26 09:29:07 2020
Rerunable = True
Resource_List.mem = 2097152kb
Resource_List.mpiprocs = 2
Resource_List.ncpus = 2
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=2
Resource_List.place = scatter
Resource_List.select = 1:ncpus=2:mem=2097152KB:mpiprocs=2
stime = Wed Feb 26 09:29:13 2020
substate = 20
Variable_List = PBS_O_HOME=/home/ztest,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=ztest,
PBS_O_PATH=/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibut
ils/bin:/opt/pbs/bin:/opt/apps/htop/2.0.2:/home/ztest/.local/bin:/home/
ztest/bin,PBS_O_MAIL=/var/spool/mail/ztest,PBS_O_SHELL=/bin/bash,
PBS_O_WORKDIR=/home/ztest,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=workq,
PBS_O_HOST=login.local
comment = job held, too many failed attempts to run
run_count = 21
Exit_status = -10
Submit_arguments = test.sh
project = _pbs_project_default

The tracejob output would be helpful to diagnose the above portion of the output. The scheduler found resources for the job, the server sent the job to MoM for execution, and MoM refused to run the job. After this happens 20 times, a hold is placed on the job. Please take a look at the MoM logs to figure out why it is refusing to run the job.