As part of our data processing pipeline, we use Hadoop to execute different tasks with HDFS being the primary data storage system. Let’s say we have a cluster, and multiple users do their work on the Hadoop cluster, all users will have the same privilege and everyone can access everyone’s data, modify it, and perform tasks on it.

Yet, every user having access to everyone’s data on the cluster is a problem. To solve this, we need to set up boundaries for each user depending on the level of access they have. In this blog, we will discuss how to use Impersonation and Kerberos to authenticate, and how a proxy user can be used to design a service for multi-user WebHDSF access.

WebHDFS

WebHDFS defines the public HTTP REST API, which permits clients/web services to access HDFS over the Web. An application that is running inside a cluster can use HDFS native protocol or the Java API to access data stored in HDFS. But what if an application wants to access a remote system externally? We can still use HDFS, but there are scenarios where the infrastructure allows only HTTP-based communication.

In this scenario, the WebHDFS API comes into play. It has the ability to execute FileSystem/FileContext operations with the help of standard REST functionalities. Enable WebHDFS on a cluster by adding the following property to the hdfs-site.xml file:

<property>
  <name>dfs.webhdfs.enabled</name>
  <value>true</value>
</property>

More about the workings of the WebHDFS can be read here Apache WebHDFS docs.

What is Kerberos?

Kerberos is a sophisticated and widely used network authentication protocol usually used by client-server applications for the client to prove its identity to the server. To use the Kerberos protocol in a Hadoop ecosystem, the Hadoop UserGroupInformation (or UGI) API gives a comprehensive framework for using Kerberos in applications.

A service for managing HDFS cluster

When designing an external service that will manage a remote HDFS cluster, we need to keep in mind that the service will be used by multiple users, not just one. In this scenario, the problem mentioned above will arise – how to implement a service that will securely manage the cluster for each user while in the boundaries of that user.

Usually, most applications that work with an HDFS cluster use a single Kerberos principal (user) for the lifetime of the application. The problem occurs when we need to have multiple users accessing the cluster, exp. through a service.

The UserGroupInformation

UserGroupInformation (UGI) gives us a standard, straightforward, and simple way to handle the complicated Kerberos processes and provides one place for almost all Kerberos/User authentication to live. To get user UGI for authentication, it needs to be initialized with the configuration of the service.

// org.apache.hadoop.conf.Configuration
final Configuration config = new Configuration();
// Set authentication to kerberos
config.set(“hadoop.security.authentication”, “kerberos”);
UserGroupInformation.setConfiguration(config);
UserGroupInformation.setLoginUser(
     UserGroupInformation.createRemoteUser("sophia")
);
UserGroupInformation.getCurrentUser();

The above code is reliant on the statefulness of the UserGroupInformation.

The login method above sets a static state within the JVM, and the authenticated principal will be used in every subsequent operation that requires authentication. In scenarios where we need to have multiple UGIs (use multiple principals) but do not have access to a proxy user (explained later), we use another approach.

Using multiple UGIs in the same JVM

For a service to authenticate with Kerberos a keytab file is necessary. Services use keytabs just like a person would use a username/password. The UGI API has a convenient method that returns an authenticated principal as a UGI object using a keytab:

UserGroupInformation.loginUserFromKeytabAndReturnUGI()

This method is used to solve multiple user connections to the same Hadoop environment, that is, having the same UserGroupInformation.setConfiguration(config).

Below is a simple example of how this works:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.security.UserGroupInformation;

import java.security.PrivilegedExceptionAction;

public class UgiTest {

    private UserGroupInformation getPrincipalUgi(final String principal,
                                                 final String keytabPath) throws Exception {
        return UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytabPath);
    }

    private Boolean mkdirs(final UserGroupInformation ugi, final Path path) throws Exception{
        return ugi.doAs(new PrivilegedExceptionAction<Boolean>() {
            public Boolean run() throws Exception {
                // Access hdfs, make directory
                final FileSystem fs = FileSystem.get(conf);
                return fs.mkdirs(path);
            }
        });
    }

    private Boolean exists(final UserGroupInformation ugi, final Path path) throws Exception{
        return ugi.doAs(new PrivilegedExceptionAction<Boolean>() {
            public Boolean run() throws Exception {
                // Assert path exists
                final FileSystem fs = FileSystem.get(conf);
                return fs.exists(path);
            }
        });
    }

    public void test() throws Exception {
        final Configuration conf = new Configuration();
        conf.set("hadoop.security.authentication", "kerberos");

        UserGroupInformation.setConfiguration(conf);
        
        final UserGroupInformation liamUgi = getPrincipalUgi("liam", "path/liam.keytab");
        final UserGroupInformation noahUgi = getPrincipalUgi("noah", "path/noah.keytab");
        
        mkdirs(liamUgi, new Path("liam/foo/bal"));
        mkdirs(noahUgi, new Path("noah/foo/bal"));

        exists(liamUgi, new Path("liam/foo/bal"));
        exists(noahUgi, new Path("noah/foo/bal"));
    }
}

The above code authenticates users “Liam” and “Noah” respectively through the getPrincipalUgi method, which returns the UserGroupInformation instance for the respective user. Later, these UGI objects can be used for subsequent operations, as shown above.

Proxy users – what is impersonation?

Another approach to designing a service that will handle the multiple user management of the cluster is to use impersonation. Hadoop allows us to configure proxy users to submit jobs or access HDFS on behalf of other users. This is called impersonation and is a crucial part of designing a service implementation.

Impersonation can also be called proxying, while the user that performs the impersonation is called the proxy. With impersonation, any jobs submitted using a proxy will have the impersonated users privilege levels. For this approach to work, we need to have a proxy user configured on the cluster.

Setting up proxies with impersonation privileges

Proxy users are managed in the core-site.xml. This gives a centralized approach to manage who can impersonate whom and from where. The following properties can be configured:

  • hadoop.proxyuser.{proxy_user_name}.hosts
  • hadoop.proxyuser.{proxy_user_name}.groups
  • hadoop.proxyuser.{proxy_user_name}.users

Hence, we can define the following restrictions as an example:

<property>
   <name>hadoop.proxyuser.sophia.hosts</name>
   <value>host_a,host_b</value>
</property>
<property>
   <name>hadoop.proxyuser.sophia.groups</name>
   <value>group_a,group_b</value>
</property>

User sophiacan impersonate users from group_a and group_b, from hosts host_a and host_b. Without these properties, impersonation is not allowed and the connection will be declined.

<property>
   <name>hadoop.proxyuser.isabella.users</name>
   <value>logan, mia</value>
</property>

We can also specifically define which users the user isabellacan impersonate. The values are a comma-separated list of hosts/groups/users ( * is used as a wildcard). Hosts list can also be specified as a list of addresses in a range:

<property>
   <name>hadoop.proxyuser.isabella.hosts</name>
   <value>10.222.0.0/16,10.113.221.221</value>
</property>

For Kerberos secured clusters, the proxy user must have Kerberos credentials.

Configuring the JAAS file

Making use of Kerberos authentication in Java is provided by the Java Authentication and Authorization Service (JAAS). In this case, the authentication method being used is GSS-API for Kerberos.

To configure JAAS we need to have the proper credentials, we can create them using Kerberos. Considering that we will only use the proxy user, the JAAS file has one principal:

com.sun.security.jgss.krb5.initiate {
    com.sun.security.auth.module.Krb5LoginModule required
    doNotPrompt=true
    principal="[email protected]"
    useKeyTab=true
    keyTab="/etc/secrets/userproxy.keytab"
    storeKey=true;
};

One way to add the jaas fileis:

System.setProperty("java.security.auth.login.config", pathToJaas);

We should also add the krb5.conf fileas well:

System.setProperty("java.security.krb5.conf", pathToKrb5);

.createProxyUser() and .doAs()

For proper impersonation, we need to use the proxy credentials for login and a proxy user UGI object created for the user that we want to impersonate. Let’s say we want to impersonate the user “william” and “william” is a valid user with credentials.

// “isabella” is the Logged in user
// Creates a UGI object for the impersonated “william" using "isabella"’s credential
final UserGroupInformation balakeyUgi = UserGroupInformation.createProxyUser(
      "william",
      UserGroupInformation.getLoginUser()
);

Finally, to perform an operation as the impersonated user we use the .doAs(...)method.

balakeyUgi.doAs(new PrivilegedExceptionAction<Void>() {
      public Void run() throws Exception {
        // Access hdfs, make directory
        final FileSystem fs = FileSystem.get(config);
        fs.mkdir(“path/to/folder”);
        // Do some more stuff
      }
 }

In the example above, this client code authenticates to the NameNode using the Kerberos credentials of isabella. Then it acts as if williammade the call.  Considering that william“made the call”, all user permissions checks will be made against william’s permissions.

The above-mentioned case studies enable us to start designing a service for multi-user WebHDFS access. To finish off, we highly recommend the book Kerberos and Hadoop: The madness beyond the gate, by Steve Loughran, which nicely explains the Kerberos and Hadoop concept.

Leave a Reply