I have just come off another Hadoop training course last week this time centered around Hive and Pig. Keeping up to date on what’s happening in the Hadoop space is time exhausting. Just recently Teradata announced a partnership with the other big Hadoop player Cloudera.
Therefore keeping track of the bugs, releases, what other people are building, how it is being used and where the platform is heading is a never ending course of reading and research. In my previous blogs I’ve covered the value of Hadoop and how important it is to have a metadata strategy for Hadoop.
Many people have a vague understanding of what Hadoop does and the business benefits it provides, but others need to delve into the detail. Over the next few blogs, I’m going to cover some of the basic individual components of Hadoop in detail. Explain what they do, some use cases and why they are important. The best approach I think would be to start from the ground and then move up. Therefore blog #1 will focus on HDFS (Hadoop Distributed File System).
The purpose of HDFS is to distribute a large data set in a cluster of commodity linux machines in order to later use the computing resources on the machines to perform batch data analytics. One of the key significant attractions of Hadoop is it’s ability to be run on cheap hardware and HDFS is the component that provides this capability. HDFS provides a very high throughout access to the data and is the perfect environment for storing large data sets. The throughput rates makes it great for quickly landing data from multiple data sources such as sensor’s, RFID and web log data.
Each HDFS cluster contains the following:
- NameNode: Runs on a “master node” that tracks and directs the storage of the cluster.
- DataNode: Runs on “slave nodes,” which make up the majority of the machines within a cluster. The NameNode instructs data files to be split into blocks, each of which are replicated three times and stored on machines across the cluster. These replicas ensure the entire system won’t go down if one server fails or is taken offline—known as “fault tolerance.”
- Client machine: neither a NameNode or a DataNode, Client machines have Hadoop installed on them. They’re responsible for loading data into the cluster, submitting MapReduce jobs and viewing the results of the job once complete.
WORM– Write Once Read Many. HDFS uses a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. There is a plan to support appending-writes to files in the future.
The following diagram downloaded from the Apache site outlines the basics of the HDFS architecture.
Diagram 1- The HDFS architecture
Data Replication within HDFS
HDFS stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. The NameNode makes all decisions regarding replication of blocks.
Accessibility of HDFS
I’ve been asked on several occasions on whether the storage of files in Hadoop is proprietary to only those applications that can access it. In fact the opposite is true. HDFS can be accessed in many different ways. Natively, HDFS provides a Java API for applications to use. A C language wrapper for this Java API is also available. In addition, an HTTP browser can also be used to browse the files of an HDFS instance.
Another interesting question that is often asked is what is the volume of data that Hadoop can hold? Well how long is a piece of string? At a minimum, a 3 cluster datanode environment according to the Teradata Hadoop appliance can hold 12.5TB per data node. That’s 37.5TB of storage at a minimum. Then add in an average compression factor of 3x and all of a sudden we are looking at 112TB of data storage for a minimum Hadoop configuration. That’s some serious storage!
Therefore in summary, HDFS is often called the “secret sauce” of Hadoop. It is the layer where the data is stored and managed. Think of it like a standard file storage system with an ability to provide data replication across commodity hardware devices. A minimal install of Hadoop has a Namenode that manages the environment (Metadata file location etc) and then multiple datanodes where the data is stored in chunks across as many datanodes available.
Ben Davis is a Senior Architect for Teradata Australia based in Canberra. With 18 years of experience in consulting, sales and technical data management roles, he has worked with some of the largest Australian organisations in developing comprehensive data management strategies. He holds a Degree in Law, A post graduate Masters in Business and Technology and is currently finishing his PhD in Information Technology with a thesis in executing large scale algorithms within cloud environments.
Latest posts by Ben Davis (see all)
- Mastering colours in your data visualisations - March 8, 2017
- Spotting the pretenders in Data Science - February 15, 2017
- Leveraging all Data in a Government/Client Engagement - November 15, 2016
- Can we defeat DDoS using analytics? - August 15, 2016
- The pitfalls of DIY Hadoop - August 8, 2016