How to aggregate heterogeneous entire clusters into “one big cluster” for cross-cluster job submission

smilezy · September 26, 2018, 12:07pm

how to aggregate heterogeneous entire clusters into “one big cluster” for cross-cluster job submission，as mentioned in PBS Works
Thanks for your attention

adarsh · September 26, 2018, 12:15pm

Please check

Peer Scheduling if you would like to do it at the PBS Level using PBS Pro OSS
4.9.31 Peer Scheduling section from https://www.pbsworks.com/pdfs/PBS18.2_BigBook.pdf
PBS Access Suite (Commercial product) portal to submit jobs across multiple PBS Pro Clusters
Connect all the compute nodes from the respective to one PBS Pro server and manage the jobs with queues and policies.

Note: you cannot mix and match Compute Nodes running Windows and Linux under on PBS Server hosted on Linux or Windows system. Linux PBS Server will serve Linux compute nodes and the same for Windows.

If my answers does not answer your queries, please explain a bit in detail

smilezy · September 27, 2018, 2:06am

Hi，adarsh
Thank you for your reply and attention, this is the answer I want, I will try, thank you very much.

smilezy · September 27, 2018, 3:01am

Hello, adarsh
I have another question about PBS database，PBS Pro’s task submission records, scheduling status and other information are stored in the postgresql database.However, I found that the database does not immediately synchronize the job information each time the job is submitted, and sometimes the job information is not saved in the job_attr table. I want to develop a cluster management system, including job management, resource management, etc.How do I get the job information in real time during the development process?
Thank you for your attention.

adarsh · September 27, 2018, 7:47am

The information disposed by qmgr -c "p s " , pbsnodes -av , qstat -fx , qstat -anws1 are in the memory.
Any other attributes you change on any object is stored in database immediately - like form qmgr
Node states are not updated in the database, since they are transient in nature, it is not stored in the database.

you can get the real time job information by running qstat -fx , tracejob , pbs_dtj
You can use the PBS Libraries and develop code which suits your requirement , plese follow the guide High-performance Computing (HPC) and Cloud Solutions | Altair

Most of the Cluster management system have PBS Pro OSS integrated. Could you please explain what you would like to achieve, so that other experienced community members might contribute their suggestions.

smilezy · September 28, 2018, 2:35am

Thank you for your reply ,We wanted to develop a job management system based on PBS, where monitoring functions are similar to Compute Manager, and we used Java to implement back-end logic, but we didn’t know how to get job information. For example, job ID, name, state, and so on

Except get the real time job information by running qstat -fx, tracejob, pbs_dtj, and then parse the query results

We are currently using to get job information from PBS postgresql database, but found that all assignments submitted will not be saved to the database, the table “job”, “job_attr”, “job_scr”, especially in most case, it will not save job attributes, so we can’t get the job from PBS database of complete information, under what conditions operation attributes are saved in the “job_attr”? If the job is not saved to the PBS postgresql database, how does the job not saved to the database be saved to memory? And there is a delay in saving to the database, why do jobs submitted through the command line not immediately save to the database?

adarsh · September 28, 2018, 7:50am

Thank you for the information.

The goal of the PBS datastore is not to store the entirety of the information of the cluster ( historical and real time information). Also, it is not good idea to connect to the PBS datastore to get the information of the cluster and the purpose of PBS datastore had no intention to support such an activity (integrate with cluster management system). It is not recommended to connect to PBS datastore with active PBS Server service , as it might have unknown consequences (integrity of the datastore).

use the PBS Pro API call to achieve this in an optimised way.
use PBS libraries to interact with the system from within Cluster Manager ( using C , python, java)

smilezy · September 29, 2018, 1:58am

Thank you very much, if we want to count all the execution hosts load information, not the vnode, and show it in the pie chart on the management system homepage, is there any advice? Does PBS provide a way to get the execution hosts load information? Thanks again

adarsh · September 29, 2018, 7:27am

Then you can use a execution host periodic hook, that collects the load information of the compute nodes at regular intervals. Similar , all health check scripts can be implemented in the same way.

exechost_periodic hook provides the framework within which any kind of information can be collected periodically

Other option:

Ganglia

Thank you for these queries

Topic		Replies	Views
In the development of the cluster management system, we use PBSPro as a tool for job management, how to get the job information in real time, through the database or command line query? Developers	4	2249	September 28, 2018
How is Postgres used? Users/Site Administrators	1	580	July 22, 2023
Looking for a "get started guide" Developers	28	7197	April 22, 2020
Performance penalty for accessing the psql DB Users/Site Administrators	6	848	May 6, 2021
PBS accounting and metrics and Elastic stack Users/Site Administrators	8	3415	May 28, 2021

How to aggregate heterogeneous entire clusters into “one big cluster” for cross-cluster job submission

Related topics