BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Designing a Highly Available, Fault Tolerant, Hadoop Cluster with Data Isolation

Designing a Highly Available, Fault Tolerant, Hadoop Cluster with Data Isolation

Bookmarks

 

Hadoop is no longer just a buzzword – It has become a business necessity. We always had an influx of data, but just recently we have unlocked the potential of this exponentially growing data. Modern techniques in big data analysis offer new methods to identify and rectify failures, help with data mining, provide feedback for optimization – the avenues are endless. The modern Hadoop ecosystem not only provides a reliable distributed aggregation system that seamlessly delivers data parallelism, but also allows for analytics that can provide great data insights.

In this article we will investigate the design of a highly available, fault tolerant Hadoop cluster. But first let’s dive into the core components of Apache Hadoop, after which we will walk through a few modifications to cater to a minimalistic list of design requirements that take advantage of the underlying Apache Hadoop infrastructure while adding security and data-level isolation. So, let’s talk about the core components (shown in figure 1) 

Figure 1: Apache Hadoop Core Components

The HDFS Cluster

The HDFS cluster is comprised of a NameNode and multiple DataNodes in master-slave architecture as shown in figure 2. The NameNode is the data manager responsible for HDFS files and blocks and also the file system namespace. The information is stored persistently on the local drives as the namespace image and the edit log. The NameNode also stores non-persistent information such as location of all the blocks for a given file. The HDFS files are split into blocks to be replicated and stored on DataNodes. Each DataNode periodically syncs up with the NameNode with information on the blocks. The HDFS architecture takes care of the data storage, fault tolerance and loss prevention parts of Apache Hadoop.

Figure 2: HDFS Architecture – NameNode (Master) – DataNodes (Slave) Configuration

YARN (YARN Application Resource Negotiator) Architecture

MapReduce 2.0 or YARN divvies up the responsibility of JobTracker, the job dispatcher from previous Hadoop versions, thus separating resource management from application management. It also helps separate job scheduling and monitoring from resource management.

The ResourceManager is a global master and an arbitrator of resources. The ApplicationMaster is capable of managing different user applications. There is one ApplicationMaster per application. So, for example, you will have one for a MapReduce application and another for an interactive application, and so on. There is a JobHistoryServer daemon that tracks these applications and records their completion. Finally, there is also a per-node slave called the NodeManager (akin to previous version’s TaskTracker) that tracks tasks that run on its particular node. YARN architecture takes care of the data computation and management aspects of Apache Hadoop.

Figure 3: YARN Architecture – ResourceManager (Master) – ApplicationMaster + NodeManager (Master+Slave) Configuration

Now that we have a background on the Apache Hadoop core components, let’s jot down our minimalistic set of design requirements. I will list them here first and then walk the details on how to cover the requirements with Apache Hadoop 2.

Requirements:

  • High Availability – cluster should never fail.
  • Security – Cluster should cover all the security layers
  • Scale Out – maximize (network IO) performance and minimize physical size.
  • Hadoop on Virtual Machines – elasticity, improved multi-tenancy, increased system utilization.

High Availability

The Hadoop 2 release, established a configuration to have an active and a standby NameNode, thus avoiding a single point of failure. Anytime the conservative Failover Controller detects a failure, it lets the standby take over and brings down the active NameNode (either via “fencing” or by “shooting the other node in the head”). The standby NameNode can take over very quickly since both the active; as well as the standby NameNodes share the edit log and the DataNodes report back to both the active and standby NameNodes.

Also, ResourceManager in YARN can support high availability. The Failover Controller is a part of the ResourceManager and brings the standby to take over in case the active fails over.

Security

Whenever we talk about security, we talk about the “layers of defense” (also known as the “rings of defense”). The layers are authentication, authorization, audit and data protection:

Security - Authentication:

The most common form of authentication available in native Apache Hadoop is Kerberos. Authentication can be from user to services e.g. HTTP authentication; or can be from services to services (as a user – e.g. proxy user or as a service – e.g. client SSL certificates).

Security - Authorization:

Apache Hadoop already provides Unix-like file permissions and also has access control lists for Map Reduce jobs, YARN, etc.

Security – Accountability/Audit:

Audit logs provide for accountability, and native Apache Hadoop has audit logs for NameNodes that record file creation, opening, etc. Also, there are history logs for JobTracker, JobHistoryServer, and ResourceManager. The history logs record all jobs that run on a particular cluster.

Security – Data at Rest & Data in Motion Protection:

Encrypting data at rest is easily done using whatever encryption the operating system has to offer, as well as other hardware level encryptions. Data in motion on the other hand, needs to be enabled in the configuration files as detailed below:

For client interaction over RPC, SASL (Simple Authentication and Security Layer) protocol can be enabled by setting the ‘hadoop.rpc.protection=privacy’ in core-site.xml. Note: Java SASL provides different levels of data protection (also known as QOP - Quality of Protection). Depending on the desired quality, the user can set the protection parameter to “authentication” for authentication-only; “integrity” for authentication and integrity of exchanged data or “privacy” for adding encryption (with symmetric keys) and avoiding “man-in-the-middle” attacks. Both integrity checks and encryption come at a performance cost.

The Data Transfer Protocol (DTP) used by HDFS data doesn’t utilize the SASL framework for authentication, so as a direct effect there is not QOP. Hence it is necessary to wrap the DTP with SASL handshaking which can be achieved by setting ‘dfs.encrypt.data.transfer=true’ in hdfs-site.xml.

Finally, for HTTP over SSL simply setting ‘dfs.https.enable=true’ and then enabling two-way SSL by ‘dfs.client.https.need-auth=true’ in hdfs-site.xml does the trick. For MapReduce Shuffle, SSL can be enabled by setting ‘mapreduce.shuffle.ssl.enabled=true’ in mapred-site.xml.

Scale Out

Although Hadoop boasts of needing only commodity hardware, data traffic with Hadoop is always a big deal. Even if you have a medium size cluster, there is a lot of replication traffic and also movement of data between Mappers and Reducers. Hence, it is very important to choose your cluster hardware with great networking backbone and at the same time provide good performance and prove economical enough to meet or exceed your scale out needs.

At Servergy, we designed such a system using Freescale’s QorIQ T4240 64bit Communications Processor. The power-efficient Servergy CTS storage appliance (shown in figure 4) has two T4240 processors. Each T4240 has a security coprocessor that is used to accelerate encryption/decryption operations. The T4240 also has four 10GigE ports and a 20Gig SRIO (serial Rapid IO) port. SRIO offers a low latency, high bandwidth interconnect. Our current single cluster only uses four of the eight 10GigE ports for each Servergy CTS appliance - two 10GigE ports are connected to an active switch and the other two are connected to a redundant/standby switch to provide high availability at the switch level. This configuration is shown in figure 5. Note: Based on deployment we can bond the 10GigE ports to increase the bandwidth or use the SRIO for low latency transport.

Figure 4: Servergy CTS Storage Appliance Block Diagram

Figure 5: CTS Appliance Solution – 10x12x2 = 240 cores; 480 threads

Hadoop on Virtual Machines

Many systems in a Hadoop cluster not only handle computational needs but also provide data storage. Hence, if you are thinking of Hadoop-as-a-service, there is always a concern about data security. Enter virtualization! Virtualization not only provides the desired isolation, it also provides elasticity. Virtualization adds to the multi-tenancy offered by YARN and improves system utilization by maximizing resource utilization. Apart from the above, ease of deployment is also a great perk of virtualization.

Keeping up with Hadoop’s traditional layout, each Virtual Machine (VM) could run a NodeManager/ TaskTracker and a DataNode as shown in figure 6. This configuration doesn’t provide all the benefits of virtualization though. To begin with, the configuration is not really elastic; you have to pre-configure it for your (growing) needs. For example, any growth in data on a tightly configured cluster would need addition of new nodes to the cluster, but now you have spare computational resources and there is also the necessity to balance the cluster. And there is no separation of data computation from data storage.

Figure 6: Traditional Hadoop layout on Virtual Machines.

In-order to make it more elastic, we can change the per VM data + compute design configuration into a more service-oriented architecture which would be more conducive to Hadoop-as-a-service providers (and even Infrastructure-as-a-service providers). Let’s discuss two new configurations:

Figure 7: Virtualized Hadoop in a Service-Oriented Architecture Configuration

Figure 7 shows a configuration where we have a single virtualized DataNode per multiple NodeManagers/TaskTrackers. Depending on the computational needs of the cluster, more NodeManagers can be added by adding more virtual nodes to the cluster. Since, each VM runs its own NodeManager, this configuration not only provides compute and data level isolation it also provides great multi-tenant isolation. The key is to find a good balance for the total number of virtualized NodeManagers per virtualized DataNode on a given host. This will largely depend on your data, replication factor, application(s) and your cluster capacity.

Another configuration that is gaining a lot of attention in the Infrastructure-as-a-service realm is having persistent data and adding virtualized computational nodes (NodeManagers/TaskTrackers) or a combination of data nodes and computational nodes to complete your cluster. This configuration is shown in figure 8.

Figure 8: Virtualized Hadoop in an Infrastructure-As-A-Service Configuration

Here the infrastructure takes care of all the networking, load balancing and persistent data storage needs. The Cluster and VM manager controls the Hadoop cluster that is comprised of the compute VMs (TaskTrackers/ NodeManagers) and the combination of ‘DataNode + NodeManager’ VMs. This configuration helps with ease of access of HDFS data in the cloud (when needed for computational purposes) and also provides for ease with data backup to persistent storage. In such a configuration, there is no need for a long running cluster in the cloud since all the data (raw, analyzed or mined) are stored persistently.

About the Author

Monica Beckwith is a Java Performance Consultant. Her past experiences include working with Oracle/Sun and AMD; optimizing the JVM for server class systems. Monica was voted a Rock Star speaker @JavaOne 2013 and was the performance lead for Garbage First Garbage Collector (G1 GC). You can follow Monica on twitter @mon_beck

Rate this Article

Adoption
Style

BT