Migrating PBS Pro from 14.2 to 18.2.3

Hello,

We have recently just completed a new installation of pbs pro 18.2.3. However, a production installation is a pbs pro 14.2 on a separate host and it’s currently independently.

What we’d like to do is migrate all of the job history and configurations to this new install at 18.2.3.

I have saved off the mom_config, the sched_configs, and qmgr -c ‘p s’ output.

What needs to be done to import the job history so that qstat of a job# that ran on 14.2 works when qstat’d on the PBS Pro 18.2.3 instance?

Thanks,
Siji

  • Job history is stored in the pbs datastore (postgres database)
  • To make sure 18.2.3 to be upgraded with job history and node / queue configuration
    • you have to make sure /etc/hosts and /etc/pbs.conf and hostnames are exactly the same.

Solution 1:
1. Stop the PBS Services of 14.2
2. Take a backup of $PBS_HOME and $PBS_EXEC
3. Upgrade this production server to 18.2.3

From here :
Case 1:

                   - Take a pg_dump from the upgraded server 
                   - import this pd_dump onto  your 18.2.3 Server
                  - (make sure you always back the $PBS_HOME with pbs service stopped, so that you can revert back ) 

OR

        Case 2:

                 - copy the $PBS_HOME  with permission intact  to  18.2.3 Server (make sure you backup its $PBS_HOME after stopping the services , so that you can revert back) 
                 - after copying the $PBS_HOME , make sure permissions are intact
                 - start the services

Please note you have to do the above just to have the job history migrated.

  1. You can get the job history from the accounting logs but not in a formatted way that is disposed by qstat -fx / qstat -f etc. Also, it is not recommended to have infinite time job history, job history is limited to couple of weeks to couple of months (rare).

  2. You can do a qstat -fx on 14.2 server and save it to a web page , so that users can check them directly for a time period. You can use the new system with migrated configuration only
    a. qmgr < output_of_qmgr_p_s_of_14.2.txx
    b. copy the sched_config file to $PBS_HOME/sched_priv/sched_config
    c. update the $PBS_HOME/mom_priv/config
    You are all set

Hope this helps.

Adarsh,

Thanks for your response!

First, I want to clarify that the 18.2.3 is a new install on a separate server cluster (totally isolated from the existing 14.2 version)

That said, I have a few follow-up questions before I test the procedure:

  1. Do we need to stop all compute jobs and the moms on the compute when you say " Stop the PBS Services of 14.2"?

  2. Assuming we performed the operations from case 2 on the new 18.2.3 PBS Pro server for testing and then we did it again unto the same 18.2.3 PBS Pro server for the final cutover, would there be any problems? i.e. repeated job numbers or other unforeseen issues?

Thanks,
Siji

Thank you ! Pleasure !

18.2.3 is new server
14.2 is existing server

  1. do you want to upgrade existing 14.2 server or keep it as it is ?
    1a. upgrade 14.2 existing server to 18.2.3 or do you want to keep it as it as 14.2 ?
    1b. The reason for upgrading 14.2 to 18.2.3 of your existing server:
    1* To make sure job history and configuration everything saved and intact
    1* We can just copy the $PBS_HOME to the new 18.2.3 server

Only the services of the PBS Pro Server / Sched / Comm ( on the server )
No need to stop the services of the PBS Mom ( on the compute nodes )

Case 2 : would not carry over the JOB history and Job Count from the 14.2 Sever . It only carry’s over the queue, node, sched, server configuration to 18.2.3. When you submit a job , the job id will start from 0 (or if in case you have tested some sample jobs, from that jobid onwards) , then it will have that job id.

With Case 2 : 100% you would not have repeated job id’s , as you using a fresh PBS data store.

Thank you

Adarsh,

Thanks for additional directions. To answer your questions, we will be shutting down the 14.2 server and the 18.2.3 will be our sole PBS server once the migration is complete.

Since we’d prefer to keep job history, can you tell us how we would do this cleanly?

Essentially, we’d want to transfer all of the job histories and PBS configurations from the 14.2 server to the 18.2.3 server. Also, we’d want to stop jobs on the 14.2 server at job number XYZ and startup new jobs on the 18.2.3 server at job number XYZ + 1.

Since you will be shutting down this 14.2 server,

  • /etc/init.d/pbs stop
  • take a backup preserving permissions of $PBS_HOME , $PBS_EXEC , /etc/pbs.conf , /etc/init.d/pbs
  • do an rpm -Uvh pbspro-server-18.2.3*.rpm on the 14.2 Server
  • /etc/init.d/pbs start
  • qstat -Bf | grep -i version ( make sure it is 18.2.3. xxxx )
  • /etc/init.d/pbs stop
  • Take a backup of $PBS_HOME preserving permissions

On the 18.2.3 server:

  • /etc/init.d/pbs stop
  • move $PBS_HOME to $PBS_HOME.old
  • copy the $PBS_HOME from the upgraded 14.2 server to 18.2.3 server to the same path
  • make sure /etc/pbs.conf matches in all respects with the 14.2 server
  • make sure /etc/hosts file matches in all respects with the 14.2 server
  • /etc/init.d/pbs start
  • your new PBS Server will have all the job history and accounting logs as the old server

Essentially, we’d want to transfer all of the job histories and PBS configurations from the 14.2 server to the 18.2.3 server. Also, we’d want to stop jobs on the 14.2 server at job number XYZ and startup new jobs on the 18.2.3 server at job number XYZ + 1.
[Answer]: Yes , the above procedure would start the job id from XYZ + 1

The key here is keeping the pristine backup of the old version with PBS Services stopped on the server host . So that we can revert back.

Good luck

Hello Adarsh,

Your suggestions worked as expected…

However, we are seeing an issue with nodes being removed from the PBS server so that we no nodes at all:

[~] # pbsnodes -l
pbsnodes: Server has no node list

I tried readding nodes but they still were getting deleted. Any thoughts on this issue?

-Siji

Thank you Siji

  • Do you have the same /etc/hosts as that of the old server ?
  • Do you have a copy of qmgr -c ‘p n @default’ / qmgr -c ‘p s’ of the old server ?
    - if not , you can create the nodes in a loop
    for i in {1…10};do qmgr -c 'c n node$i ’ ; done

The nodes should be listed in the /etc/hosts or it should DNS resolvable to their hostname ( forward/reverse)

Thank you

It looks like upon restart the server is deleting the nodes, still not sure why. Here’s what I see in the server log:

09/19/2019 09:06:10;0004;Server@bright01;Node;nsd01;attributes set: at request of root@bright01.thunder.ccast

09/19/2019 09:06:10;0004;Server@bright01;Node;nsd01;attributes set: resources_available.ncpus = 1

09/19/2019 09:06:10;0100;Server@bright01;Req;;Type 9 request received from root@bright01.thunder.ccast, sock=15

09/19/2019 09:06:10;0004;Server@bright01;Node;nsd01;attributes unset: at request of root@bright01.thunder.ccast

09/19/2019 09:06:10;0004;Server@bright01;Node;nsd01;attributes set: queue =

09/19/2019 09:06:10;0100;Server@bright01;Req;;Type 9 request received from root@bright01.thunder.ccast, sock=15

09/19/2019 09:06:10;0004;Server@bright01;Node;license;attributes set: at request of root@bright01.thunder.ccast

09/19/2019 09:06:10;0004;Server@bright01;Node;license;attributes set: resources_available.ncpus = 1

09/19/2019 09:06:10;0100;Server@bright01;Req;;Type 9 request received from root@bright01.thunder.ccast, sock=15

09/19/2019 09:06:10;0004;Server@bright01;Node;license;attributes unset: at request of root@bright01.thunder.ccast

09/19/2019 09:06:10;0004;Server@bright01;Node;license;attributes set: queue =

09/19/2019 09:06:10;0100;Server@bright01;Req;;Type 9 request received from root@bright01.thunder.ccast, sock=15

09/19/2019 09:06:10;0004;Server@bright01;Node;globus0001;attributes set: at request of root@bright01.thunder.ccast

09/19/2019 09:06:10;0004;Server@bright01;Node;globus0001;attributes set: resources_available.ncpus = 1

09/19/2019 09:06:10;0100;Server@bright01;Req;;Type 9 request received from root@bright01.thunder.ccast, sock=15

09/19/2019 09:06:10;0004;Server@bright01;Node;globus0001;attributes unset: at request of root@bright01.thunder.ccast

09/19/2019 09:06:10;0004;Server@bright01;Node;globus0001;attributes set: queue =

09/19/2019 09:06:10;0100;Server@bright01;Req;;Type 9 request received from root@bright01.thunder.ccast, sock=15

09/19/2019 09:06:10;0004;Server@bright01;Node;node0116;deleted at request of root@bright01.thunder.ccast

09/19/2019 09:06:10;0100;Server@bright01;Req;;Type 9 request received from root@bright01.thunder.ccast, sock=15

09/19/2019 09:06:10;0080;Server@bright01;Req;req_reject;Reject reply code=15062, aux=0, type=9, from root@bright01.thunder.ccast

09/19/2019 09:06:10;0100;Server@bright01;Req;;Type 0 request received from root@bright01.thunder.ccast, sock=15

09/19/2019 09:06:10;0100;Server@bright01;Req;;Type 49 request received from root@bright01.thunder.ccast, sock=19

qmgr -c ‘p s’ | grep lic
set server pbs_license_min = 0

set server pbs_license_max = 2147483647

set server pbs_license_linger_time = 31536000

Any thoughts as to the source of the problem here?

Thank you for posting these logs.

Are you using any cluster manager tools ?

Yes, so we use Bright to manage our cluster but we rely on qmgr directly to configure the PBS server.

If i am not correct, please pardon me, but i think some scripts from the cluster manager side might be doing this, you can run audit software and find out who is the culprit, at any point in time , PBS Pro OSS would not be doing it by itself.

You were correct, Adarsh - and we’ve since rectified the issues.

Thanks again for all your suggestions!

1 Like