Kerberos support

There seems to be Kerberos integration code in the code base (guarded behind PBS_CRED_DCE_KRB5), but I wasn’t able to find any configure switches to actually enable the code.

Could you direct me to the correct configure switch to enable this code?

Kerberos was a supported feature many moons ago. It’s quite likely that it could be made to work again, but it would take some amount of effort. It would certainly be possible to add a new macro under the m4 directory to define the PBS_CRED_DCE_KRB5 flag, but that’s just where the fun begins. DCE used to use an extended version of Kerberos as a ticket based authentication mechanism upon which services like DCE/DFS were built. I don’t think there is anything in the code that is DCE/DFS specific, but if there is we should remove it. Of course, unit testing this type of setup would also take some significant amount of effort. The gauntlet has been thrown down. Would you care to answer the challenge?

Well, we have full Kerberos support for Torque. It’s used in a sizeable production environment.

We will have to either port it to PBSPro, or adjust the current code in PBSPro. I was mainly trying to figure out what is the status of the code I’m seeing.

The current Kerberos code could certainly use work. We last tested the kerberos functionality several years back.

See include/libsec.h
#define STD 0 /* standard PBS security (pbs_iff program) /
#define KAUTH 1 /
kerberized PBS with authentication only /
#define KCRYPT 2 /
kerberized with authentication, encryption */

The encryption related code is probably somewhat incomplete, but the authentication piece should work with minor modification between clients commands and server.

However, we recently replaced the RPP protocol (UDP based) between the server and moms with a full TCP based communication (called TPP), and there we had implemented authentication using the munge authentication.

So the bigger work would be to implement kerberos authentication/encryption over the the TPP protocol.

I have some notes I had made several years back on the kerberos code when I last worked on it; probably can search and find it if you care.

Trying to get through the code, I have also discovered PBS_CRED_GRIDPROXY which again does not seem to be hooked up to configure option.

Is this used by anybody, or should I just merge all the Kerberos code into a cleaned up comprehensive implementation (probably won’t support all the use cases though).

@HappyCerberus, the kerberos option is not used by anybody in the recent versions of pbs AFAIK. I think the community would love to have the kerberos code merged and cleaned up into a comprehensive implementation. It would be fine to start with some minimal use cases and then expand over time.

Regards,
Subhasis

Hello,

I have almost finished full Kerberos support in pbspro. We have the GSS layer over TCP (client-server batch protocol - this part is already well tested and it is in production) and over TPP (IS and IM - some work still in progress). It has the potential to serve for certificates too.

The Kerberos user’s credentials are centralized. It means that the pbs server obtains the user credentials and the server passes the credentials to the superior mom and superior mom passes the credentials to the sisters’ mom. Of course, the user’s credentials are refreshable after a configurable time.

The pbs server obtains the credentials from a configurable tool. This tool is NOT part of pbspro (of course our tool is open source) and you can use your own tool. The credential’s tool is just supposed to provide the credentials in base64 on stdout.

Are you interested in merging to upstream? Should I start to work on the design document? My biggest concern is that the tool for obtaining credentials is external. Is it OK with you?

FYI: The implementation ignores the current gss/krb/cert implementation in pbspro.

Vasek

Hello @vchlum,

We’ll need to inspect the changes you are making. Is there a way for us to do so prior to your opening a pull request? It might speed things up.

Just curious… does your implementation handle refreshing/renewing credentials so they don’t expire? That was something I had to implement for DCE support in NQS many years ago.

Thanks,

Mike

Hi @vchlum

We are certainly very interested to take this forward. As @mkaro suggested, can you please share your fork/branch, so that we can inspect/try-it-out? Also, it would help if you can describe the flow in some simple way - even text is fine (with references to the code). Assuming it is a large change, any such explanation would help review the code more easily.

Subhasis

@mkaro, @subhasisb the ongoing work can be seen here: https://github.com/CESNET/pbspro/tree/kerberos_support_2

Yes, credentials renewing is included but the credentials itself are provided by some external tool, which is not part of pbspro and this work.

I’ll try to provide some comments. I’ll split the comments into three parts because there are three commits on the branch. I am sorry there are poor comments in the code yet. I did not expect to provide it today:)

FYI: We use Heimdal Kerberos but it should work with MIT. I just did not test it against MIT yet.


Commit ‘Kerberos support’


The first part ‘Kerberos support’ comes from my former colleague. We used this part in Torque before we started to use pbspro. So, this whole commit is a port from Torque. We use this commit in production now. This commit contains gss layer over TCP, which means that the connection between server and qsub, qstat, pbsnodes, … is encrypted. This commit also contains renewing credentials by the mom itself (the renewing will be replaced).

GSS over TCP: In the __pbs_connect_extend() a new batch request (PBS_BATCH_GSSAuthenUser) is send. This br is followed be direct communication client<->server in pbsgss.c:pbsgss_client_authenticate() on the client side and by pbsgss.c:req_gssauthenuser() on the server side. The gss handshake is done here. After this, the gss context is bounded with the socket by tcp_dis.c:DIS_tcp_set_gss(). In the tcp_dis.c we have new implementation of DIS_tcp_wflush() (if the context is present the gss_wrap() is processed) and new implementation of tcp_read() (if the context is present then gss_unwrap() is processed).

Credentials renewing by mom: The credentials renewing implementation depends on our own library [1] in this commit. (This dependency will be removed before the PR will be raised.) All the procedures related to the renewing are in the resmom/renew.c. The renewing is done by a forked process (one process per job). This process is forked by renew.c:start_renewal() in the start_exec.c:start_process() and in the start_exec.c:finish_exec(). This forked process renew the credentials before expiring by calling krb525 library[1]. The afs logging is included.


Commit ‘Centralized kerberos support’


This commit removes the necessity of krb525 library, though the krb525 fallback is still present.

The server has new attributes configurable by qmgr:
cred_renew_enable - enables sending credentials by server to moms.
cred_renew_tool - sets the renew tool, this external tool provides credentials in base64
cred_renew_period - after this time credentials are sent from pbs server to moms
cred_renew_cache - after this time credentials are renewed using cred_renew_tool (credentials are cached per principal on server in the memory)

The credentials are provided by an external tool (cred_renew_tool) on stdout (we use our krb525 tool [1] for this but you can use your own tool) The stdout of the tool is supposed to look like this:

torque1:~# ./krb525_renew vchlum@META
Type: Kerberos
Valid until: 1536960989
doICPDCCAjigAwI<... just base64 here ...>GqAjAA

Periodical renewing credentials is done on server by work task svr_credfunc.c:svr_renew_creds(). All the jobs are iterated in this function and new job attribute JOB_ATR_cred_validity is checked here and once the time (cred_renew_period) is up the fresh credentials are send to the mom of the job. This uses new batch request PBS_BATCH_Cred. See req_cred.c:send_cred(). Mom processes this br by request.c:req_cred() and resend the credentials to the sisters via new im request IM_CRED in renew.c:send_cred_sisters().

The krb525 forked process from the first commit ‘Kerberos support’ is removed but the work task krb525_fallback_renewal() replaces the process. We need the fallback for smooth switching to centralized kerberos.

We also still need a separate process for each job because of the afs log and the pag. The pag need to be set in the same forked process for working. See renew.c:start_afslog(). After renewing credentials afs log process does the krb5_afslog() on signal send by the mom.


Commit ‘gss tpp’


Overall the GSS over TPP is similar to the GSS over TCP. After rpp_open() - client side - a new IS (IS_GSS_ESTABLISH_CONTEXT) or IM (IM_GSS_ESTABLISH_CONTEXT) is sent and the tpp_gss.c:tppgss_client_authenticate() is run. On the server side (mom is the server now), the procedure tpp_gss.c:tppgss_server_authenticate() is run and the context is established. Here, I still have a problem to solve because the reading of tpp is nonblocking. This causes also issue with unwrapping of encrypted message. This is ongoing work.

On the client side, the ccache (is needed by gss_acquire_cred()) is created or renewed (if needed) from keytab (you need a keytab on all nodes) by init_pbs_ccache_from_keytab().

I am available for questions. Every comment is much appreciated. Thank you,

Vasek

[1] https://github.com/CESNET/krb525

More to think concerning the gss tpp… Since the direct communication is not possible on tpp, I have rewritten the port from tcp handshake, and inspired by setup_gss() and IS_GSS_HANDSHAKE, I did the handshake by interchanging tokens using a new type of request (GSS_HANDSHAKE - this is actually new type of ‘protocol’ - like im, is,…). This new type of request is used only for the gss handshake. My concern is that before the handshake is established some messages will be probably sent in cleartext.

Thanks again Vasek, this is an impressive body of work. WRT the handshake we’ll need to figure out which messages are sent cleartext and whether they pose any security risk. If they do, we’ll need to employ SSL or something similar. IIRC, all communication with the KDC is encrypted, as is the TGT that the KDC issues, so I don’t think it will pose a problem. I haven’t had time to review all the code yet. What time frame (roughly) do you think you’ll submit your PR for review. We’ll need to reserve sufficient developer/tester resources for review and testing, which also means we’ll need a Kerberos setup (likely MIT based) available. We’ll need some lead time for that as well.

With the latest commit, I would consider the gss over tpp done. I have added a new attribute ‘wrapped’ at the beginning of every message send via tpp. This new attribute indicates whether the message is encrypted. It is necessary mainly because of distinguishing multicast messages. Multicast messages are not (cannot be) encrypted.

@mkaro I believe I am able to prepare the PR in several weeks. My rough guess would be 3-5 weeks. Are you comfortable with it?

Concerning the cleartext before the handshake is finished: Maybe, It is possible to withhold the messages in a new buffer until the connection will be secured. I will test soon which messages are actually in cleartext…, and I need to deny/drop/reject unsecured messages with credentials anyway. - It is on top of my todo.

Hi @vchlum - many thanks!

I have started going through the changes in your repo (in an attempt to understand the workings). I have a very initial question.

When we added munge authentication to PBS, i had added a function called engage_external_authentication (called from engage_authentication) which would send a message of type “PBS_BATCH_AuthExternal” (this message had subfields to tell the server what type it is etc). Your current code seems to be adding “PBS_BATCH_GSSAuthenUser” as a new type and is handled separately from the above (attempted generic) mechanism.

Did you have a specific need to not use (or extend) the above “generic” structure? If not, would you like to extend that one and add kerberos in the similar way as munge? (of course then i envision that you would need to make changes to that message/exchange itself, but we would then still have a common way to do authentication.

I have not yet gone over your TPP related changes, but one bug we found recently was that our munge authentication forgets the fact that TPP introduces concurrency in the code (multiple threads might be doing authentication) - you probably already took care of concurrency, but anyways mentioning - just in case.

I will keep looking.

@subhasisb This first commit is a port from Torque. There was no authenticate_external() in Torque, and we use this commit in production, and to be honest I would like to provide backward compatibility, which I understand can be problematic. That is the honest reason why the generic structure is not used.

Anyway, I will need to extract the krb525 related code to the separate branch before raising PR and I can do the same with PBS_BATCH_GSSAuthenUser (and provide the backward compatibility in our production for a limited time). Well, I agree with you that using the generic structure is better and AFAICT I do not see any obstacles to do it. I will add something like AUTH_KRB to the authenticate_external().

If I understand you correctly, I do not have the multiple threads problem because the tpp encryption is not on transport protocol but on the rpp stream layer.

Vasek

@vchlum that sounds great. If you would like to make further changes, that would be great. As you would agree, we would want to keep this open source effort as generic as possible.

Thanks a lot, again, for the great work you guys are doing.

Regards,
Subhasis

You should be able to compile the branch with MIT now. AFS support is autodetected. I use configure like this:

CFLAGS="-g -ggdb -Wall -Werror" ./configure --prefix=/usr --with-krbauth PATH_KRB5_CONFIG=/usr/bin/krb5-config.heimdal
CFLAGS="-g -ggdb -Wall -Werror" ./configure --prefix=/usr --with-krbauth PATH_KRB5_CONFIG=/usr/bin/krb5-config.mit

V.

Hello @vchlum, we have a question for you… are you relying on any of the stale DCE/Kerberos code that currently exists in the PBS Pro source? If not, we should attempt to clean that up prior to integrating your changes. Otherwise, it could lead to a great deal of confusion having two independent implementations in place both claiming to support Kerberos.

Thanks,

Mike

Hello @mkaro,

the implementation is independent of the stale code and it should be safe to remove the stale code.

Of course, I am also willing to do the cleanup.

Vasek

Hi Vasek,

It would seem pretty foolish on our part to refuse your generous offer of removing the dead code. It is very much appreciated. There are other deprecated areas we are slowly removing from the code base. A good housecleaning, so to speak.

Thanks,

Mike