PP-578: server memory leak for hooks load test

This message is to inform developers that there is a new design document available for review:

https://pbspro.atlassian.net/wiki/display/PD/PP-578%3A+server+memory+leak+for+hooks+load+test

Please provide your comments, suggestions, and feedback to the community.

Thanks,

Mike

Given the problems experienced with embedding Python, I like the idea of providing these controls.

However, PBS Pro has many (too many?) places for configuration parameters, including:

    • Command line arguments
    • Environment variables
    • Startup scripts (although these usually leverage the above methods)
    • pbs.conf
    • Other config files (for MOMs and Schedulers)
    • qmgr

The design suggests these new config parameters be set via #2, #3, and #4 (only). I’m actually on the fence about this. On the one hand, as experimental parameters, putting them in the least intrusive place possible makes sense. On the other hand, there has been an ongoing desire to make qmgr the single point for configuration. (Qmgr has the advantage of being able to do error checking, perform logging of all changes, and cause new configuration values to actually get used right away. The only thing that can’t be in qmgr is how to get to qmgr data – the networking).

I’d love to see us getting rid of non-qmgr settings rather than adding more.

Hi Bill,

I completely agree, and have been advocating for the migration of (nearly) all configuration parameters to the DB via qmgr (as you suggested) for some time. The only thing the services really need to know is how to “phone home” to get their configuration. Once obtained, it can be cached locally as a fallback if the server goes down. I was hoping to use some undocumented environment variables to address the bug at hand, but the environment has been lost by the time we check whether Python needs a restart. Our standard configuration variables were the only fallback. While I agree we need a better mechanism, this is the status quo until such a mechanism exists.

Thanks,

Mike

Hey Mike,

/proc/self/statm is a linux file. What about windows? This this only a linux message?

Bhroam

Unless we need them for the python interpreter at the mom also, one option could be to add them as server attributes, something like qmgr -c “set server PBS_PYTHON_RESTART_MAX_HOOKS=xxx” - since they really alter the server’s behavior?

Hi Bhroam,

Correct, on systems where /proc/self/statm is not present the actual data will be replaced with “unknown”.

Mike

Hi Subhasis,

Yes, we could add them as server attributes rather than configuration variables. Longer term, all of this should go away once the PBS/Python integration gets fixed such that Python can perform garbage collection properly. I was taking the path of least resistance by adding them as config vars. I can change them if you (or others) feel strongly about this.

While unit testing I was updating the pbs_server script to set environment variables prior to exec’ing pbs_server.bin, in addition to altering the pbs.conf file.

Thanks,

Mike

Could you add that if the memory usage is not available, “Unknown” will be printed to Interface 8?

Bhroam

Hi Mike,

I think that it would be better to add them as server attributes. The qmgr command prvovides a clean, consistent, and easy to manange/test interface for whatever amount of time we are supporting this interface. The functionality may go away in the short to near term, but we cannot count on that. And, even if it does go away in the near term, an implementation in qmgr still takes us in the direction that we need to go in terms of unifying the interface.

  -Ian

Hi Ian,

The implementation is complete and signed off at this point. I’ll defer to Scott C. as to whether I should spend more time on this.

Mike

Hi Bhroam,

Done.

Mike

Hi PBS Pro Maintainer…

I’m worried about the precedent that is being set here: design published on Mon at 11a followed by a declaration on Tue at 3p (the next day) that “it’s final”.

I would strongly encourage the maintainers to require a longer design review period and provide the community with at least a few business days for comments and discussion before declaring a design final.

Thanks!

Regarding the design itself – I’m OK with environment variables… as long as we are OK with new contributions also using environment variables for similar settings.

@mkaro I would suggest that we change the implementation to use qmgr. That way we set the direction for future. I am not particularly averse to the current implementation, however, I would hate to make this implementation a norm for every other implementation till the day we fix the Hooks/PBS interface (which could be a bit far away).

@billnitzberg I do share your concern. We need to work out a way via which we can say that a design is actually seen and agreed upon before we finalize the implementation.

The design document has been updated to reflect the comments and suggestions. The implementation and corresponding pull request has also been updated. Please review and provide your feedback in as timely a manner as possible.

1 Like

Sounds like an excellent task for a small architectural review committee. :grin:

2 Likes

After some discussion in the office about what is public and what is private, I’m going to make a comment. Mike, do you want your log messages Public? I feel a lot of log messages should be Private/Unstable unless there is a good reason to do otherwise. If you make these Public, they are more difficult to change in the future. Of course they are experimental, so that means they are more easily changed than if they’re stable. If you feel like they should be public, then keep them there.

One reason to make a log message public and stable or experimental would be if it were going to be used in the PTL tests being written to cover the bug fix.

It’s always been my impression that anything visible to an admin should be made Public. That said, I’ll be glad to make them private if there are no objections.

Hey Mike, few comments.

  1. Define the type of those variables if they take integer or long values. Is 0 an acceptable value?
  2. Is there a maximum range for these values?
  3. What happens if admin set one or all of them to 0 (I understand it is unlikely) but do we have a control of how often python interpreter would get re-started. Also would there be a warning message for such scenarios?