I am new in my company and recently I was trying to make some changes in our PBS Pro V12.0.1 (commercial).
We have a SGi cluster to run fluent and cfd++ and, as I was told, recently PBS stopped working.
Every time I submitted a new job to any queue (we have the workq, cfd and fluent) we got the answer:
qsub: Bad UID for job execution.
At this time, qmgr and qstat were working.
I tried to solve this problem changing some stuff in qmgr, using set server acl_host, set server acl_user and etc, with no change in the problem, but also we could still access qstat and qmgr.
Then, we tried to add set server acl_hosts_enable into qmgr, and after making this input we lost connection to PBS.
I mean, I can use pbs_probe, pbs_mom and the server is working, however we cannot even check the qstat or qmgr anymore, we get these outputs:
$: qstat
pbs_iff: error returned: 15031
No Permission.
qstat: cannot connect to server host (errno+15007)
$: qmgr
pbs_iff: error returned: 15031
No Permission.
qstat: cannot connect to server host (errno+15007)
I even tried reconnecting to the server, without success:
I suspect that PBS_EXEC/sbin/pbs_iff is no longer SUID or the network filesystem on which the file lives is mounted in such a way that it prevents SUID programs from obtaining root access. There are two binaries in sbin that should be SUID… pbs_iff and pbs_rcp.
If that doesn’t help, you may want to look up the flatuid setting for the PBS server in the admin guide.
Since you’re running commercial PBS, please feel free to contact Altair customer support if these suggestions don’t resolve your issue.
Yes, I tried to check if the pbs_iff is SUID using pbs_probe and there were no problems concerning PBS infrastructure were found. Both _iff and _rcp are -rwxr-xr-x (Octal 4755).
I tried to contact them and they have no idea what is going on, I am almost trying to reinstall it from the beginning again, so at least I have an idea of how it was configured.
Anyhow, do you know the name of the file which qmgr access? I mean, where those directives are saved? I thought maybe if I could delete the last line I have made it could solve the problem.
Yes, sorry for the typo, and the octal is 4755 for both.
I just tried to check the flatuid, and the only was I can change is through PBS_HOME/mom_priv/config.
There, we had a line written $restrict_user_maxsysid 499
I deleted this line and restarted pbs. The same problem continues.
Thanks for all your help. I have no idea about the type of installation made, I will check that.
The thing is, although we could not send the cases due to the Bad UID output before, after the input of acl_hosts_enable=True in the qmgr it stopped working.
So, although I am a dummie here, I don’t believe the installation or firewall is the problem here. The problem, in my little understanding is that input (acl_hosts_enable), and how to recover from it.
Do you know how can I acces the qmgr without having to go through terminal? I mean, is there a file which qmgr saves it inputs directives which I can change back?
Hello @Alexandre. There are no dummies here, just those with less experience than others. There is no supported way to avoid the qmgr command line. Being a commercial PBS Pro user, I highly suggest you contact your customer service representative to get you back online as quickly as possible. It’s their job to support our commercial customers. Otherwise, we’ll just keep trying things until it eventually works. I suspect that’s not the most efficient path. Thanks!
Thanks a lot. I’ve contacted support and sent the log files from server, mom and sched folders. I was also checking on it, to see the changes but nothing that I could reverse.
Anyhow, thanks for your kind help. As soon as I receive support’s reply, and fix the issue as well, I will make an update here to share how to solve this problem.