Migrating PBS Pro from 14.2 to 18.2.3

sijisaula · August 23, 2019, 6:27pm

Hello,

We have recently just completed a new installation of pbs pro 18.2.3. However, a production installation is a pbs pro 14.2 on a separate host and it’s currently independently.

What we’d like to do is migrate all of the job history and configurations to this new install at 18.2.3.

I have saved off the mom_config, the sched_configs, and qmgr -c ‘p s’ output.

What needs to be done to import the job history so that qstat of a job# that ran on 14.2 works when qstat’d on the PBS Pro 18.2.3 instance?

Thanks,
Siji

adarsh · August 23, 2019, 9:47pm

Job history is stored in the pbs datastore (postgres database)
To make sure 18.2.3 to be upgraded with job history and node / queue configuration
- you have to make sure /etc/hosts and /etc/pbs.conf and hostnames are exactly the same.

Solution 1:
1. Stop the PBS Services of 14.2
2. Take a backup of $PBS_HOME and $PBS_EXEC
3. Upgrade this production server to 18.2.3

From here :
Case 1:

                   - Take a pg_dump from the upgraded server 
                   - import this pd_dump onto  your 18.2.3 Server
                  - (make sure you always back the $PBS_HOME with pbs service stopped, so that you can revert back )

OR

        Case 2:

                 - copy the $PBS_HOME  with permission intact  to  18.2.3 Server (make sure you backup its $PBS_HOME after stopping the services , so that you can revert back) 
                 - after copying the $PBS_HOME , make sure permissions are intact
                 - start the services

Please note you have to do the above just to have the job history migrated.

You can get the job history from the accounting logs but not in a formatted way that is disposed by qstat -fx / qstat -f etc. Also, it is not recommended to have infinite time job history, job history is limited to couple of weeks to couple of months (rare).
You can do a qstat -fx on 14.2 server and save it to a web page , so that users can check them directly for a time period. You can use the new system with migrated configuration only
a. qmgr < output_of_qmgr_p_s_of_14.2.txx
b. copy the sched_config file to $PBS_HOME/sched_priv/sched_config
c. update the $PBS_HOME/mom_priv/config
You are all set

Hope this helps.

sijisaula · August 30, 2019, 11:56am

Adarsh,

Thanks for your response!

First, I want to clarify that the 18.2.3 is a new install on a separate server cluster (totally isolated from the existing 14.2 version)

That said, I have a few follow-up questions before I test the procedure:

Do we need to stop all compute jobs and the moms on the compute when you say " Stop the PBS Services of 14.2"?
Assuming we performed the operations from case 2 on the new 18.2.3 PBS Pro server for testing and then we did it again unto the same 18.2.3 PBS Pro server for the final cutover, would there be any problems? i.e. repeated job numbers or other unforeseen issues?

Thanks,
Siji

adarsh · August 30, 2019, 12:27pm

Thank you ! Pleasure !

18.2.3 is new server
14.2 is existing server

do you want to upgrade existing 14.2 server or keep it as it is ?
1a. upgrade 14.2 existing server to 18.2.3 or do you want to keep it as it as 14.2 ?
1b. The reason for upgrading 14.2 to 18.2.3 of your existing server:
1* To make sure job history and configuration everything saved and intact
1* We can just copy the $PBS_HOME to the new 18.2.3 server

Only the services of the PBS Pro Server / Sched / Comm ( on the server )
No need to stop the services of the PBS Mom ( on the compute nodes )

Case 2 : would not carry over the JOB history and Job Count from the 14.2 Sever . It only carry’s over the queue, node, sched, server configuration to 18.2.3. When you submit a job , the job id will start from 0 (or if in case you have tested some sample jobs, from that jobid onwards) , then it will have that job id.

With Case 2 : 100% you would not have repeated job id’s , as you using a fresh PBS data store.

Thank you

sijisaula · September 3, 2019, 8:43pm

Adarsh,

Thanks for additional directions. To answer your questions, we will be shutting down the 14.2 server and the 18.2.3 will be our sole PBS server once the migration is complete.

Since we’d prefer to keep job history, can you tell us how we would do this cleanly?

Essentially, we’d want to transfer all of the job histories and PBS configurations from the 14.2 server to the 18.2.3 server. Also, we’d want to stop jobs on the 14.2 server at job number XYZ and startup new jobs on the 18.2.3 server at job number XYZ + 1.

adarsh · September 3, 2019, 9:49pm

Since you will be shutting down this 14.2 server,

/etc/init.d/pbs stop
take a backup preserving permissions of $PBS_HOME , $PBS_EXEC , /etc/pbs.conf , /etc/init.d/pbs
do an rpm -Uvh pbspro-server-18.2.3*.rpm on the 14.2 Server
/etc/init.d/pbs start
qstat -Bf | grep -i version ( make sure it is 18.2.3. xxxx )
/etc/init.d/pbs stop
Take a backup of $PBS_HOME preserving permissions

On the 18.2.3 server:

/etc/init.d/pbs stop
move $PBS_HOME to $PBS_HOME.old
copy the $PBS_HOME from the upgraded 14.2 server to 18.2.3 server to the same path
make sure /etc/pbs.conf matches in all respects with the 14.2 server
make sure /etc/hosts file matches in all respects with the 14.2 server
/etc/init.d/pbs start
your new PBS Server will have all the job history and accounting logs as the old server

Essentially, we’d want to transfer all of the job histories and PBS configurations from the 14.2 server to the 18.2.3 server. Also, we’d want to stop jobs on the 14.2 server at job number XYZ and startup new jobs on the 18.2.3 server at job number XYZ + 1.
[Answer]: Yes , the above procedure would start the job id from XYZ + 1

The key here is keeping the pristine backup of the old version with PBS Services stopped on the server host . So that we can revert back.

Good luck

sijisaula · September 18, 2019, 5:22am

Hello Adarsh,

Your suggestions worked as expected…

However, we are seeing an issue with nodes being removed from the PBS server so that we no nodes at all:

[~] # pbsnodes -l
pbsnodes: Server has no node list

I tried readding nodes but they still were getting deleted. Any thoughts on this issue?

-Siji

adarsh · September 18, 2019, 7:29am

Thank you Siji

Do you have the same /etc/hosts as that of the old server ?
Do you have a copy of qmgr -c ‘p n @default’ / qmgr -c ‘p s’ of the old server ?
- if not , you can create the nodes in a loop
for i in {1…10};do qmgr -c 'c n node$i ’ ; done

The nodes should be listed in the /etc/hosts or it should DNS resolvable to their hostname ( forward/reverse)

Thank you

sijisaula · September 19, 2019, 2:30pm

It looks like upon restart the server is deleting the nodes, still not sure why. Here’s what I see in the server log:

09/19/2019 09:06:10;0004;Server@bright01;Node;nsd01;attributes set: at request of root@bright01.thunder.ccast

09/19/2019 09:06:10;0004;Server@bright01;Node;nsd01;attributes set: resources_available.ncpus = 1

09/19/2019 09:06:10;0100;Server@bright01;Req;;Type 9 request received from root@bright01.thunder.ccast, sock=15

09/19/2019 09:06:10;0004;Server@bright01;Node;nsd01;attributes unset: at request of root@bright01.thunder.ccast

09/19/2019 09:06:10;0004;Server@bright01;Node;nsd01;attributes set: queue =

09/19/2019 09:06:10;0100;Server@bright01;Req;;Type 9 request received from root@bright01.thunder.ccast, sock=15

09/19/2019 09:06:10;0004;Server@bright01;Node;license;attributes set: at request of root@bright01.thunder.ccast

09/19/2019 09:06:10;0004;Server@bright01;Node;license;attributes set: resources_available.ncpus = 1

09/19/2019 09:06:10;0100;Server@bright01;Req;;Type 9 request received from root@bright01.thunder.ccast, sock=15

09/19/2019 09:06:10;0004;Server@bright01;Node;license;attributes unset: at request of root@bright01.thunder.ccast

09/19/2019 09:06:10;0004;Server@bright01;Node;license;attributes set: queue =

09/19/2019 09:06:10;0100;Server@bright01;Req;;Type 9 request received from root@bright01.thunder.ccast, sock=15

09/19/2019 09:06:10;0004;Server@bright01;Node;globus0001;attributes set: at request of root@bright01.thunder.ccast

09/19/2019 09:06:10;0004;Server@bright01;Node;globus0001;attributes set: resources_available.ncpus = 1

09/19/2019 09:06:10;0100;Server@bright01;Req;;Type 9 request received from root@bright01.thunder.ccast, sock=15

09/19/2019 09:06:10;0004;Server@bright01;Node;globus0001;attributes unset: at request of root@bright01.thunder.ccast

09/19/2019 09:06:10;0004;Server@bright01;Node;globus0001;attributes set: queue =

09/19/2019 09:06:10;0100;Server@bright01;Req;;Type 9 request received from root@bright01.thunder.ccast, sock=15

09/19/2019 09:06:10;0004;Server@bright01;Node;node0116;deleted at request of root@bright01.thunder.ccast

09/19/2019 09:06:10;0100;Server@bright01;Req;;Type 9 request received from root@bright01.thunder.ccast, sock=15

09/19/2019 09:06:10;0080;Server@bright01;Req;req_reject;Reject reply code=15062, aux=0, type=9, from root@bright01.thunder.ccast

09/19/2019 09:06:10;0100;Server@bright01;Req;;Type 0 request received from root@bright01.thunder.ccast, sock=15

09/19/2019 09:06:10;0100;Server@bright01;Req;;Type 49 request received from root@bright01.thunder.ccast, sock=19

qmgr -c ‘p s’ | grep lic
set server pbs_license_min = 0

set server pbs_license_max = 2147483647

set server pbs_license_linger_time = 31536000

Any thoughts as to the source of the problem here?

adarsh · September 19, 2019, 2:54pm

Thank you for posting these logs.

Are you using any cluster manager tools ?

sijisaula · September 19, 2019, 3:02pm

Yes, so we use Bright to manage our cluster but we rely on qmgr directly to configure the PBS server.

adarsh · September 19, 2019, 3:11pm

If i am not correct, please pardon me, but i think some scripts from the cluster manager side might be doing this, you can run audit software and find out who is the culprit, at any point in time , PBS Pro OSS would not be doing it by itself.

sijisaula · September 20, 2019, 2:24pm

You were correct, Adarsh - and we’ve since rectified the issues.

Thanks again for all your suggestions!

Topic		Replies	Views
How to go from PBS Pro to PBS Community Users/Site Administrators	20	4788	July 22, 2019
Issue starting PBS Users/Site Administrators	26	12255	August 17, 2018
How is Postgres used? Users/Site Administrators	1	544	July 22, 2023
What is the migration steps from OpenPBS to PBSpro Users/Site Administrators	0	28	December 11, 2024
Pbs doesnt start after openhpc update Users/Site Administrators	10	3425	November 2, 2018

Migrating PBS Pro from 14.2 to 18.2.3

Related topics