Sharing default_excl

Hello, I want to have a set of nodes with sharing=default_excl attribute.
It looks like it is not possible to simply “create” the node with that attribute (I got the error
“Cannot set attribute, read only or insufficient permission sharing”).
Following the thread “MOM sharing config”, I created a file
/var/spool/pbs/mom_priv/config.d/exclconf
with the following contents (this is on node hpc07.fpg.local)
$configversion 2
hpc07.fpg.local: sharing=force_exclhost/var/spool/pbs/mom_priv/ o

Then I tried to create a node with the command (withing qmgr)
create node hpc07.fpg.local
It works, but the sharing is still “default_shared”.
PBS is version 20.0.1
Should I delete the config file in /var/spool/pbs/mom_priv?
Any other solution to set sharing=default_excl?
As a related question: is there way to create a queue whose “default” behaviour is having the nodes in
exclusive mode?
Thanks in advance!
Massimo

I modified the config2 file as follows
$configversion 2
hpc07.fpg.local: ntype=PBS
hpc07[0].fpg.local: state = free
hpc07[0].fpg.local: sharing = default_shared

I stopped the pbs_mom daemon and restart but I still get the error
pbs_mom: System error: (15010) in new_vnid, add_mom_data hpc07[0].fpg.local

I understand that default_shared is the “default” and, as a matter of fact, if I remove the line
hpc07[0].fpg.local: sharing = default_shared
there is no error.
My next try has been to replace “sharing” with “shoring”, that is
hpc07[0].fpg.local: shoring = default_shared
this also works “fine” (meaning that there is no error). My understanding is that PBS “ignore”
unknown attributes (shoring is obviously not defined). So the puzzle is, why is there the
System error: (15010) in new_vnid
using the “regular” sharing=default_shared?
Does anybody know an alternative method to set the (v)node in exclusive mode?
Is there a way to know the list of legal attributes that it is possible to set for a node?
Thanks in advance and best regards,
Massimo

Could you please try this:

  1. when you run the hostname command on hpc07, the result is short hostname hpc07
  2. delete vnodes of hpc07 and the natural node hpc07
  3. delete the config2 file
  4. restart the pbs mom services
  5. create a config2 file with the below contents $PBS_HOME/mom_priv/config.d/hpc07
    hpc07: sharing=default_excl
  6. restart the mom services
  7. qmgr: create node hpc07

Then the sharing should be set to default_excl ( pbsnodes -av )

Hope this helps

Thanks for your reply. The hostname (as reported by the command hostname) was
hpc07.fpg.local
I modified it to hpc07 both with the command and by changing the /etc/hostname file
Then I created the file
$PBS_HOME/mom_priv/config.d/hpc07
with the contents
$configversion 2
hpc07: sharing = default_shared
(just to check) but I still get the same error (after restarting the pbs_mom daemon).
pbs_mom: System error: (15010) in new_vnid, add_mom_data hpc07 failed
So I modified the file contents to
$configversion 2
hpc07: status = free
and this time starting the pbm_mom daemon does not report any error.
It looks like there is something “wrong” with the “sharing” attribute.
Any “trick” to by-pass it?
Thanks again and best regards,
Massimo

There should not be any space while assigning the attribute
for example:

$configversion 2
hpc07:sharing=default_excl

Thanks. I tried with
$configversion 2
hpc07:sharing=default_excl
(no space) but I got the same error
pbs_mom: System error: (15010) in new_vnid, add_mom_data hpc07 failed
In a previous reply, you mentioned
“delete vnodes of hpc07 and the natural node hpc07”
I delete the hpc07 node using the “delete node” command thru qmgr.
Did this delete both the vnode and “natural” node hpc07?
As a matter of fact, if I run the command
pbsnode -av
there is nothing with hpc07.
My feeling is that the problem is, somehow, related to the sharing attribute.
I can also consider starting everything from scratch. Is there any mode to create queue and nodes
with “exclusive” access as default?
Thanks again,
Massimo

You have to individualy remove vnode first and later the natural vnode
qmgr: d n hpc07[0]
qmgr: d n hpc07
Update the config2 file, restart pbs mom services
qmgr: c n hpc07
pbsnodes -av

No there is no way except using config2 file

Thanks a lot. I fear I am doing something really dumb.
If I run the
pbsnodes -av
command on both the server (hpc01) node and the client node (where I want the vnode) there is
no hpc07 node. If I run the command
qmgr: d n hpc07[0]
qmgr: d n hpc07
on the server node I got an "Unknown node " message and this makes sense to me.
Now, it is my understanding that I need to create the config2 file in the /var/spool/pbs/mom_priv/config.d of the client node (hpc07), right?
And it is my understanding that I should restart the pbs mom services (daemon /opt/pbs/bin/qmgr) always on that node (hpc07), right?
If this is the case, I still get the same error. What is driving me crazy is that it is, apparently that “sharing” that causes troubles. If I insert a different attribute in the config2 file (like “state” or even something that does not exist like “foo”) there is no
System error: (15010) in new_vnid, add_mom_data hpc07 failed
error. Why does “sharing” trigger that error?
Thanks again for your patience!
Massimo

Please note,

  1. the “qmgr” commands should be run as a “root” user on the PBS Server host and not on the compute nodes.
  2. The config2 files should be created by the root user on the Compute Nodes (PBS MOM nodes) at this location $PBS_HOME/mom_priv/config.d/

Steps to check and retry:

  1. Take a pbs_snapshot -o /root/ , so that your configuration is saved
  2. qmgr : delete node @default # this will delete all the nodes on the system
  3. edit your /etc/hosts ( on the pbs sever host and compute node(s) )
    xxxx.xxx.xxx.xxx hpc07.fpg.local hpc07 # add hpc07
  4. do the config2 setup in config.d directory on the compute nodes for default_excl
  5. qmgr : c n hpc07

You do not have to set it to default_shared, the default configuration without being set is default_shared after deploying PBS Pro.

I am not sure whether it is valid to set status attribute in the config2 file.
Avoid doing that, it might not cause issues.

Please note: you can leave the default setting of default_shared , which is the default setting of the PBS Pro after the installation. You can run with job exclusivitity

qsub -l select=1:ncpus=1 -l place=excl – /bin/sleep 1000

Thanks for all your help. In the end, I opted for a radical approach. I downloaded the source of openpbs, compiled, installed, and started everything from scratch (but before I saved the old configuration…).
Now the situation is much better. There is still the error using the “sharing=default_excl” in the config2
files but using -lplace=excl in the submission script works as expected. Now that I have the source I can try to understand why there is that error when the pbs_mom daemon tries to add that attribute in the avl tree. Thanks again,
Massimo

1 Like

Same problem here (pbs_version = 20.0.1).

[root@adano23 ~]# pbs_mom -s show sharing_config
$configversion 2
adano23[0]: sharing = default_excl
[root@adano23 ~]# pbsnodes -v adano23[0] | grep sharing
sharing = default_shared
[root@adano23 ~]# pbsnodes -v adano23 | grep sharing
sharing = default_shared

I already tried to delete the “natural vnode” (adano23) and vnode (adano23[0]), and create then again. No sucess… Is it a BUG?

You would need to delete the vnodes first and the natural node at the end.
Also, we have to make sure no jobs are running on them.