Is it possible to run control commands with root@compute_node?

I just installed a fresh copy of PBS Pro 14.1.2. To utilize LBNL nhc, which runs on compute nodes, checks the system sanity and runs pbsnodes -o XXX to mark node offline if failed. However I got “Error marking node node2 - Unauthorized Request” and “Reject reply code=15007, aux=0, type=9, from root@node2.localdomain” when running pbsnodes -o node2 with root@node2. I’ve tried to add node2 or node2.localdomain into /etc/hosts.equiv as well as setup ssh password-less access from node2 to main node, but none of them helps. Any ideas on that?

Could you please set the below and retry :
qmgr -c "set server managers+=root@" # use it with caution @
qmgr -c “set server flatuid=true”

I see. Is there any way to limit the access to specified host groups? (expect set root@each_node to be manager)
In my thought setting like this will result in any individual linux users can operate on PBS server with (actually their own) root identity. Using firewall won’t help, cause user can always use port forwarding to bypass it.

In Administrator Book 14.2, Section 7.3.13.1, I read this:

The value of flatuid also affects whether .rhosts and host.equiv are checked. If flatuid is True, .rhosts and host.equiv are not queried, and for any users at host2, only UserA is treated as UserA@host1. If flatuid is False, .rhosts and host.equiv are queried.

Any chance to achieve this by setting host.equiv?

It is better to set manager(s) with each of the compute nodes
for i in pbsnodes -av | ^[a-zA-Z];do qmgr -c “set server managers+=root@$i.localdomain”;done

If you do not want to set flatuid to true, then .rhosts and host.equiv should work.

You can set the flatuid to true and check the server attributes acl_host_enable and acl_hosts

If you do not want to set flatuid to true, then .rhosts and host.equiv should work.

I do have written node2 and node2.localdomain into hosts.equiv, but still gets Error marking node node2 - Unauthorized Request.

You can set the flatuid to true and check the server attributes acl_host_enable and acl_hosts

For that I have around 80 compute nodes to manage, I prefer not to add each node to acl_hosts. Rather, I’m actually looking for some ‘external’ lists such as hosts.equiv file to add all nodes into it.

  1. When create nodes on the PBS Server based on the ‘hostname’ command output on the respective compute nodes

qmgr -c “create node node1”
or
qmgr -c “create node node1 Mom=node1.localdomain”

  1. Are there multiple network adaptors on the headnode or compute nodes ?

  2. Use the pbs_hostn -v
    At the server, use the pbs_hostn command with the name of each host (compute node) in the complex. This should complain if hostname resolution is not working correctly. Check PBS Pro admin guide:2.16 pbs_hostn

  3. Please share your hosts.equiv file

main:~ # pbs_hostn -v node0

primary name: node0 (from gethostbyname())
aliases:           main
aliases:           node-mgmt
     address length:  4 bytes
     address:          10.10.10.85   (1426721290 dec)  name:  node0

main:~ # pbs_hostn -v node1

primary name: node1.localdomain (from gethostbyname())
aliases:           node1
aliases:           node1-eth0.localdomain
aliases:           node1-eth0
     address length:  4 bytes
     address:           10.10.10.1   (17435146 dec)  name:  node1.localdomain

main:~ # pbs_hostn -v node2

primary name: node2.localdomain (from gethostbyname())
aliases:           node2
aliases:           node2-eth0.localdomain
aliases:           node2-eth0
     address length:  4 bytes
     address:           10.10.10.2   (34212362 dec)  name:  node2.localdomain

main:~ # cat /etc/hosts.equiv

node2.localdomain
node2

main:~ # ip -4 a

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq portid 0894ef5e946c state UP group default qlen 1000
    inet 10.10.10.85/16 brd 10.10.255.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet 10.0.0.254/24 brd 10.0.0.255 scope global eth0
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq portid 000000000317 state UP group default qlen 1000
    inet XXX.XXX.XXX.XXX/23 brd XXX.XXX.XXX.XXX scope global eth1
       valid_lft forever preferred_lft forever
4: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 100
    inet 10.8.0.5/24 brd 10.8.0.255 scope global tap0
       valid_lft forever preferred_lft forever

(eth1 is external interface, and tap0 is VPN interface)

main:~ # qmgr -c “p n node2”

#
# Create nodes and set their properties.
#
#
# Create and define node node2
#
create node node2 Mom=node2.localdomain
set node node2 state = free
set node node2 resources_available.arch = linux
set node node2 resources_available.host = node2
set node node2 resources_available.mem = 97607848kb
set node node2 resources_available.ncpus = 24
set node node2 resources_available.vnode = node2
set node node2 resv_enable = True
set node node2 sharing = default_shared

main:~ # grep node2 /etc/hosts

10.10.10.2              node2.localdomain node2 node2-eth0.localdomain node2-eth0

Thank you for this information.

  • try to add the aliases to the host.equiv file
    – please check whether the node name resolve and reverse resolve to same name.
  • try to open-up permissions to all in the host.equiv file and then slow restrict once you get it working
    Reference: http://man7.org/linux/man-pages/man5/hosts.equiv.5.html

Thank you for you time. Here’s the result:
2. hosts.equiv is already set to 0644.
1. Set hosts.equiv to the following content (I’ve changed target from node2 to node22)

node22.localdomain
node22
node22-eth0.localdomain
node22-eth0

Then run with root: ssh node22 pbsnodes -o node22, which returns: Error marking node node22 - Unauthorized Request

Thank you for your patience

  • could you please check /etc/hosts is fully populated with all the aliases is the same on the headnode and across all the compute nodes.

  • can we try to get it work by opening the security, to make sure it works with flatuid

  • if it is working with flatuid set to true, then we will unset this server attribute

  • host.equiv with the below conent
    +

  • could you please let us know whether pbsnodes -o node22 works (when run from the server ) ?
    Also, please qmgr -c "set node node22 state = offline " # this is the best practice than using pbsnodes -o

  • worst case, would it be possible to sanitise the /etc/hosts to contain only the canonical names and try it out.
    Please check the $PBS_HOME/comm_logs for any issues.

I can reproduce your scenario, i can successfully execute the below commands if and if, set serve rmanagers+=root@* , or else adding root@FQDN of the nodes.

  1. pbsnodes -o
  2. qmgr -c “set node state=offline”

even with hosts.equiv populated, i could not succeed.

To be sure enough, you mean you can’t succeed even with managers+=root@ and flatuid=false and hosts.equiv added with proper hostnames?

Shall I wait or submit a bug report? Or try something more?

It works when managers+=root@ is set and when flatuid=false (or unset)
It works when managers+=root@ is set and when flatuid=true
if does not work when managers+= is not set + with flatuid set to true or false

Note:
The server’s flatuid attribute affects both when users can operate on jobs and whether users without accounts on the server host can submit jobs.

Seems no hope to keep security while enabling compute nodes to run qmgr set ?

Hello, it was mentioned toward the beginning of the thread that you must set the managers attribute to explicitly list the allowed account@host. This is an entirely separate mechanism from flatuid, which uses ruserok() (which consults hosts.equiv/rhosts) to deal with JOB authorization/submission/control capabilities. If you want accounts other that the root account on the server host to be able to control node/queue/etc. information they must be added as managers (or operators), no other way to do it.

@runapp, is adding all of the nodes to the managers list and leaving flatuid as the default value false (secure, using ruserok() calls) not working for you?