MOM sharing config

Hi,

I am configuring a cluster, where all nodes will be job-exclusive, each node will run (at most) one job. Each job may span several nodes.

So, each MOM/node needs the config:
sharing = force_exclhost

After some digging it appears that the “right place” (the only place) this can be defined is in a so-called “version 2 config file” (under mom_priv/config.d/).

I can make this work, if I create a config file (called a “script” in the manuals) with content like:

$configversion 2
node01: sharing = force_exclhost
node02: sharing = force_exclhost
node03: sharing = force_exclhost

All nodes in the cluster are named in the file, and the file is copied to all nodes in the cluster.
In principle, it is sufficient to have just the line for the local node name, but that means that the nodes will have individual config, and I would rather avoid that.

Is there a way to make this config without having to explictly name every node (either in a single file with all the node names - or in a node-centric file, which then differ from node to node)?

What I am looking for may be if it is possible to glob nodenames - or leave them out entirely in MOM version 2 config files.

Thanks,

Bjarne

Hi,

If you are running a CPUSET mom on a SGI Altix machine then you have to use the config version 2 file to change the sharing option of each vnode that the CPUSET mom would be reporting. (note its illegal to mix the sharing mode for various vnodes in the same host).

Also for a cluster, if you need to change the sharing attirbute, you need to add a config version 2 file for each mom and add the name of the vnode and set it to force_exclhost. Finally you will have to do a pbs_mom insert -s vnodefile input_file on each mom

HTH
Subhasis

We are not running on any fancy machine. Just simple compute nodes, so one vnode per physical node as you describe.

However, it is not possible to directly update with qmgr:

[bjb@bifrost1 ~]$ sudo qmgr -c "set node dn101 sharing=force_exclhost"
qmgr obj=dn101 svr=default: Cannot set attribute, read only or insufficient permission sharing
qmgr: Error (15003) returned from server

This was actually what I tried to do earlier on. However, it failed, and I turned to the manual (RTFM, I know), and it states that there are 4 ways to configure the MOM - and not all methods will work for all parameters: qmgr, pbsnodes, as well as v1 (mom_priv/config) & v2 (mom_priv/config.d/*) config files.

I read from, the guide Admin guide (v 13.1) p. AG-56:

Use pbs_mom -s insert to set the sharing vnode attribute

I understand from this line that this particular setting cannot be set with qmgr, and that the “right thing” is to add files to mom_priv/config.d as per

dn101: sharing = force_exclhost

Really, “pbs_mom -s insert” just adds a “version 2 config file” (in config.d) for the mom. The file is is actually only used once the mom is restarted (also, I think that “pbs_mom -s insert” does not error-check the file either - it just copies it in). Personally, I opt to just create the correct files on node deployment - and not use the pbs_mom interface.

So, I am pretty comfortable with the way to get the config in there (not by qmgr).
My question is just if there is any kind of shorthand in the notation for v2 config files, which would allow a sinle line in the config file to apply to more than a single node (i.e. globbing or all nodes) for settings which must go in a version 2 file, but will be constant for the entire cluster (or a large number of nodes).

Thanks,

/Bjarne

PS: Why there is an API (pbs_mom -s) to create, list and cat files in mom_prov/config.d is something which I don’t quite understand yet, but as long as I don’t really have to use it, I can live with it.

@buchmann,

Yes, you are right - you need to create the v2 config file for each mom in your case. I understand this is a bit cumbersome. There is no current way to do this in a single config file for all nodes.

The reason we created an api/interface is in anticipation of the information moving to the database. This is something we need to still complete…

HTH

Regards,
Subhasis

Thanks Subhasis,

[edited]

Just to clarify:

This actually seems correct. What I tried to do:

seems actually to define multiple vnodes on the single host - which is definitely not what I want.
Thus it seems like there is presently no way around having a node-centric version 2 MOM config file - ie. differing config files on each node - although the config is actually exactly the same.

(There is nothing broken in the present implementation, it is just a lot of bookkeeping.)

/Bjarne

Would request you to create an enhancement ticket so that other community members can implement this missing functionality of changing the sharing on normal nodes via qmgr.

Regards,
Subhasis

https://pbspro.atlassian.net/browse/PP-416

1 Like

I don’t know exactly how Open Source PBSPro differs from commercial PBSPro, but on commercial PBSPro these kinds of version 2 config files are usually things I create in a modified pbs_habitat script – the script is called when PBS_HOME is missing for a MoM so it’s quite easy to make sure that it creates exactly the v1 and v2 config files you need by tinkering with the script. Since the script runs when you know the hostname, it’s easy to add the correct line.

In node images I usually do not have PBS_HOME at all. I make sure the PBS_EXEC exists (often it’s not in the image but the image has a link to a shared filsystem directory) and tinker with the pbs_habitat script to coax it to make the PBS_HOME I need.

I am blissfully unaware of pbs_habitat. I have found a single mentioning of the file in the Ref Guide (v13.1, Table 14-1, page RG-477), but that is simply a file listing and does not tell anything about what the file is for. It is not mentioned at all in the Admin Guide (v13.1). It is mentioned in the Install Guide as part of SGI-specifics (around page IG-61) as well as in the “Upgrade” section - not stuff I expected to read.

I will probably have to look further into that that, but really I should like to not run too many custom made scripts for setting up a single node. We will have several different services, which must run on both frontend and nodes - eg. Ganglia and Nagios interactions.

Our installation is either kickstart based (install from scratch, with due modifications along the way) or golden-image based, as you describe:

But in that case, I will need to

  1. Explicitly remove the PBS_HOME directory after taking an image from a golden node.
  • Keep one or more scripts (pbs_habitat or something else) up-to-date to automatically generate the necessary config.

  • Make sure to run correct post-install scripts .

Personally, I would probably opt to keep PBS_HOME in the image, and only change/edit the files, which I know to need update (such as replacing the golden node name with the local host name in specified config files).

Anyway, thanks a lot for the input.

Best,

/Bjarne