The current Hadoop market is dominated by two players being Cloudera and Hortonworks. Both are built on top of open source Hadoop and are very similar in their packaging except with a few differences in applications (Impala, Ambari, Ranger, Sentry etc etc) from a software perspective and their support structures. Standing on the sidelines reminds me of watching a similar game played out over a decade ago in the Linux space when you had Redhat, Suse and others all competing in the same space.
Our customers, thinking about going down the Hadoop pathway, often have different objectives in their journey and come from different angles in how to begin. Sometimes they will setup a lab environment with a small deployment of the open-source no-frills Hadoop and go from there by adding packages and building out their cluster from that. The risk is when to identify that the lab is ready for the prime time in a production sense and whether they should stick with the open-source version or to convert across to an Enterprise grade distro complete with support moving forward. Or they will decide to go all in and begin their journey with an Enterprise grade Hadoop from day 1. The question on their mind is which one to choose?
I’m often asked by customers and peers which distro to go with, either Cloudera or Hortonworks. My answer will often be prefaced by a range of commentary including support options, resources in the market and who else is using which and how they are going on the journey. I’m in the enviable position to offer my views and recommendations backed by a deep understanding of multiple factors. However recently I’ve been challenging those asking me the question as to why they should hedge all their bets on a single vendor? After all, if the differences aren’t too great then why not go with a dual vendor strategy?
A single data lake versus multiple data lakes
If you’ve heard of the concept of the data lake then you know it’s the approach of landing data of all shapes and sizes onto a low-cost no-schema environment. The data lake is then used to refine data and serve up to multiple analytic environments such as a data warehouse, SAS or Teradata’s Advanced Analytics platform Aster. The common approach in deploying a data lake thus far has been a single data lake for the organisation. This design approach is similar to the mindset in the 90’s with data warehouses where we would build a single warehouse that would be all things to all people. In modern times we now have some customers with multiple data warehouses with the primary driver being the requirement for separation of data and workloads. Especially in government we see a need for a data warehouse to store highly classified datasets and to keep them physically separated from other datasets. Take this design and now apply it to the data lake concept. Whilst a single data lake has the merits of storing all of the data under one roof handling different workloads and different security rights the reality is that it can quickly become a data management nightmare. The driver for having multiple data lakes is not a technology driver but rather driven by corporate needs for isolating different workloads, data security requirements, country boundaries, and corporate divisions.
Deploying your data lake 3 ways
When it comes to a data lake deployment strategy you essentially have the choice of three architectures.
- Shared Nothing
- Shared Management
- Shared Everything
1. The Shared Nothing deployment
The shared nothing architecture you may already be familiar with if you’ve been knocking around Massively Parallel Processing (MPP) architecture for a while. This concept is based on the view that each Hadoop cluster has it’s own dedicated storage, processing and management. An example of this is depicted in the following diagram:
2. The Shared Management deployment
Using this deployment model, you maintain the separation of clusters, however centralize the management of the clusters under a single management layer. This approach still physically keeps the data separate and meets the numerous compliance and security requirements, however reduces administrative overhead of managing multiple clusters.
3. The Shared Everything deployment
This approach is about how many have deployed their data lakes using a single cluster to service multiple data types, multiple users and multiple workloads.
How you choose to deploy Hadoop is entirely up to your data security, workload and geographical boundaries. What you have here is flexibility. Don’t think that your data lake has to be a single lake with a single management layer. If you need to build multiple lakes, don’t be afraid to.
Ben Davis is a Senior Architect for Teradata Australia based in Canberra. With 18 years of experience in consulting, sales and technical data management roles, he has worked with some of the largest Australian organisations in developing comprehensive data management strategies. He holds a Degree in Law, A post graduate Masters in Business and Technology and is currently finishing his PhD in Information Technology with a thesis in executing large scale algorithms within cloud environments.