Can't start PBS server (15011 decode_attr_db job_sort_formula)

Have a working version of OpenPBS on a test box. After a reboot it will start all daemons but the server. Error below from server logs.

[root@service0 server_logs]# ps -ef | grep pbs
root 56326 1 0 09:28 ? 00:00:00 /opt/pbs/sbin/pbs_comm
root 56341 1 0 09:28 ? 00:00:00 /opt/pbs/sbin/pbs_sched
root 56454 1 0 09:28 ? 00:00:13 /opt/pbs/sbin/pbs_ds_monitor monitor
postgres 56512 1 0 09:28 ? 00:00:00 /usr/bin/postgres -D /p/home/pbs/server/datastore -p 15007

[root@service0 server_logs]# more 20220326
03/26/2022 09:28:55;0002;Server@service0;Svr;Log;Log opened
03/26/2022 09:28:55;0002;Server@service0;Svr;Server@service0;pbs_version=20.0.0
03/26/2022 09:28:55;0002;Server@service0;Svr;Server@service0;pbs_build=mach=N/A:security=N/A:configure_args=N/A
03/26/2022 09:28:55;0002;Server@service0;Svr;Server@service0;hostname=service0.ib0.ice-tds.erdc.hpc.mil;pbs_leaf_name=N/A;pbs_mom_node_name=N/A
03/26/2022 09:28:55;0002;Server@service0;Svr;Server@service0;ipv4 interface lo: localhost4.localdomain4
03/26/2022 09:28:55;0002;Server@service0;Svr;Server@service0;ipv4 interface ib0: service0.ib0.ice-tds.erdc.hpc.mil
03/26/2022 09:28:55;0002;Server@service0;Svr;Server@service0;ipv4 interface bond0: service0.head.ice-tds.erdc.hpc.mil
03/26/2022 09:28:55;0002;Server@service0;Svr;Server@service0;ipv6 interface lo: localhost6.localdomain6
03/26/2022 09:28:55;0002;Server@service0;Svr;Server@service0;ipv6 interface ib0: service0
03/26/2022 09:28:55;0002;Server@service0;Svr;Server@service0;ipv6 interface bond0: service0
03/26/2022 09:28:55;0006;Server@service0;Fil;Server@service0;Version 20.0.0, started, initialization type = 1
03/26/2022 09:28:56;0002;Server@service0;Svr;Server@service0;pbs_status_db exit code 1
03/26/2022 09:28:56;0002;Server@service0;Svr;Server@service0;Starting PBS dataservice
03/26/2022 09:29:00;0002;Server@service0;Svr;Server@service0;connected to PBS dataservice@service0.ib0.ice-tds.erdc.hpc.mil
03/26/2022 09:29:00;0086;Server@service0;Svr;pbs_python_ext_quick_start_interpreter;–> Python Interpreter quick started, compiled with version:'3.6.8 (default, Nov 16 2020, 16:55:22)

[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]’ ←
03/26/2022 09:29:00;0086;Server@service0;Svr;pbs_python_ext_quick_start_interpreter;–> Inserted Altair PBS Python modules dir ‘/opt/pbs/lib/python/altair’ ‘/opt/pbs/lib/python/altair
/pbs/v1’<–
03/26/2022 09:29:00;0086;Server@service0;Svr;pbs_python_ext_quick_shutdown_interpreter;–> Stopping Python interpreter ←
03/26/2022 09:29:00;0d80;Server@service0;TPP;Server@service0(Main Thread);TPP authentication method = resvport
03/26/2022 09:29:00;0c06;Server@service0;TPP;Server@service0(Main Thread);TPP leaf node names = 10.148.0.41:15001,127.0.0.1:15001,10.148.0.41:15001,172.23.0.22:15001
03/26/2022 09:29:00;0d80;Server@service0;TPP;Server@service0(Main Thread);Initializing TPP transport Layer
03/26/2022 09:29:00;0d80;Server@service0;TPP;Server@service0(Main Thread);Max files allowed = 16384
03/26/2022 09:29:00;0d80;Server@service0;TPP;Server@service0(Main Thread);TPP initialization done
03/26/2022 09:29:00;0d80;Server@service0;TPP;Server@service0(Main Thread);Connecting to pbs_comm service0:17001
03/26/2022 09:29:00;0c06;Server@service0;TPP;Server@service0(Thread 0);Thread ready
03/26/2022 09:29:00;0c06;Server@service0;TPP;Server@service0(Thread 0);Registering address 10.148.0.41:15001 to pbs_comm service0:17001
03/26/2022 09:29:00;0c06;Server@service0;TPP;Server@service0(Thread 0);Registering address 172.23.0.22:15001 to pbs_comm service0:17001
03/26/2022 09:29:00;0c06;Server@service0;TPP;Server@service0(Thread 0);Connected to pbs_comm service0:17001
03/26/2022 09:29:00;0002;Server@service0;n/a;setup_env;read environment from /p/home/pbs/server/pbs_environment
03/26/2022 09:29:00;0000;Server@service0;Svr;Server@service0;Supported authentication method: resvport
03/26/2022 09:29:00;0004;Server@service0;Svr;Server@service0;node_fail_requeue value changed to 0
03/26/2022 09:29:00;0001;Server@service0;Svr;Server@service0;PBS server internal error (15011) in decode_attr_db, Action function failed for job_sort_formula attr, errn 15011

Thanks, Chris

@christopher Could you please share the output of qstat -Bf and qmgr -c "p s"

The server won’t start, so no ability to run qstat or qmgr. There would be no jobs in the queue as this system is very lightly used. I don’t unfortunately have a current qmgr backup. But its very simpler to our other PBSPro systems. This system was changed to OpenPBS in Jan and has been working since then.

The formula would look something like:

set server job_sort_formula = queue_priority + (0.005 * ncpus) + (0.000277778 * eligible_time) + paws + bias

create resource paws
set resource paws type = float
set resource paws flag = r
set server resources_default.paws = 0
create resource bias
set resource bias type = float
set resource bias flag = r
set server resources_default.bias = 0

Hi,

You can remove the job_sort_formula from the database directly. Keep a copy of PBS home, then run the following steps:

  1. run pbs_ds_password to set a known password for the database
  2. Start postgres by running pbs_dataservice start
  3. Login to the database using psql -p 15007 -d pbs_datastore -U postgres
  4. select * from pbs.server;
  5. update pbs.server set attributes = delete(attributes, ‘job_sort_formula’);
  6. Stop database using pbs_dataservice stop
  7. Restart pbs services

For the specific job_sort_formula that is causing the server restart failure, you could submit a bug to be fixed by the community.

Regards,
Subhasis

1 Like

Thank you Subhasis. From “select *” I see my formula as:

“job_sort_formula”=>“11.queue_priority+(0.005*ncpus)+(0.000277778*eligible_time)+paws+bias”

However the delete doesn’t work:

pbs_datastore=# update pbs.server set attributes = delete(attributes, ‘job_sort_formula’);
ERROR: column “‘job_sort_formula’” does not exist
LINE 1: …te pbs.server set attributes = delete(attributes, ‘job_sort_…
^
pbs_datastore=# update pbs.server set attributes = delete(attributes, “job_sort_formula”);
ERROR: column “job_sort_formula” does not exist
LINE 1: …te pbs.server set attributes = delete(attributes, "job_sort_…
^
pbs_datastore=# update pbs.server set attributes = delete(attributes, job_sort_formula);
ERROR: column “job_sort_formula” does not exist
LINE 1: …te pbs.server set attributes = delete(attributes, job_sort_f…
^
Chris

This is probably the copy paste error inserting special quotes surrounding the job_sort_formula, , if we type it we will see the debug message

ERROR: function delete(public.hstore, unknown) does not exist
LINE 1: update pbs.server set attributes = delete(attributes,'job_so…
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.

Ahh, requires single quotes but not Microsoft quotes. LOL. Thanks.

The server is started now.

How do I submit a bug report?

Could this be related? Crash on server restart if job_sort_formula is set · Issue #2475 · openpbs/openpbs · GitHub

1 Like

I submitted a new bug on this, but #2475 is the same issue. It was opened Aug 2021. How can the formula be broken for almost a year?

Thanks, Chris

I played around with this a little. The following hack allows the server to come all the way up, giving you a chance to fix any issues with job_sort_formula. It is not a complete fix in that it does not detect a bad formula when reloading from the database (but, if the database is corrupted, you have other issues). I did not find any other problems in my (limited) testing.

diff --git a/src/server/sched_func.c b/src/server/sched_func.c
index d6b4f285..2c69c372 100644
--- a/src/server/sched_func.c
+++ b/src/server/sched_func.c
@@ -162,8 +162,13 @@ validate_job_formula(attribute *pattr, void *pobject, int actmode)
                        return PBSE_SVR_SCHED_JSF_INCOMPAT;
        }
 
-       if (!Py_IsInitialized())
+       if (!Py_IsInitialized()) {
+               if (get_sattr_long(SVR_ATR_State) == SV_STATE_INIT) {
+                       /* Cannot validate during initialization */
+                       return PBSE_NONE;
+               }
                return PBSE_INTERNAL;
+       }
 
        globals1 = malloc(globals_size1);
        if (globals1 == NULL) {

Altair just merged a fix for this: Fix server not starting with job_sort_formula set by bayucan · Pull Request #2533 · openpbs/openpbs · GitHub