In case you haven’t realised it yet, it’s 2015. The year ahead for technology promises yet more advances. We will see slimmer more vibrant TV’s, data sensors becoming inbuilt into everything that we use (cars, toothbrushes, homes etc) and I dare say a plethora of new smart phones that will be just that little bit more intelligent, faster and packed with extras that half of us will never use.
But what about enterprise technology and the evolving nature of data analytics? Certainly over the past few years we have been on the upward curve of the Gartner hype cycle for analytics. In particular we have seen the Hadoop market literally explode with a hive of activity as a result of organisations wanting to get more insights and results out of their data.
Figure 1: Gartner Hype cycle. Source: Wikipedia
But whilst we continue to see organisations delve deeper into entrenching Hadoop into their landscape, it is mindful to remember that this is not new technology.
Hadoop was born out of Google in 2005 as Google was one of the first organisations to experience the data explosion that only today most other organisations are experiencing. The rest is history and has been written about many times over as Google went on to develop the Google File System (GFS) and MapReduce. These two technologies were then used to crawl, analyse and rank the billions of web pages into a result set that we all see at the front end of the Google interface every time we search.
Then Apache got on board and what was produced was Apache Hadoop which had at it’s core HDFS (based on GFS) and MapReduce amongst an array of other capabilities.
Over the years we have seen Hadoop evolve into this ecosystem of open source technologies with a wide range of amusing names such as Oozie, Pig, Hive, Spark and zookeeper. But we have also seen it become mainstream and adopted by many organisations. We have also seen Teradata build the technology into it’s data ecosystem by developing the Unified Data Architecture and forming partnerships with Hortonworks and more recently MapR.
But what’s interesting for us in the Hadoop field is that we can see that MapReduce which is a core component of the original Hadoop technology is not as important as it used to be.
So for this blog I decided to look at 2 technologies that are more recent and promise to evolve the Hadoop journey and overcome the barriers we have encountered in the past. But bear in mind that even though we are just starting to see these technologies appear in the Enterprise, the concept and design of these are now a few years old. This goes to show that in the open source community it takes a while to go from concept to Enterprise.
Great yet another catchy name! Hadoop loves large data sets. In fact the more you give it, the more it will revel in it’s duty. But it does suffer from full table scans each time you add more data. What that basically means is that as your data grows your analysis time gets longer and longer as you are constantly re-scanning large data sets.
In fact many organisations I have spoken to have thought that they can use Hadoop for fast processing of large and growing datasets without knowing that as they grow their data, performance can suffer. So often the joke is that once you kick off a MapReduce job, you may as well go off and make a coffee, do your shopping, watch a movie and then come back to see if the job has been completed.
Therefore the smart cookies over at Google came up with percolator. In essence, percolator is an incremental processing engine. It replaces the batch-based processing approach with incremental processing and secondary indexes. The result being that as you add more data, your processing times do not blow out as full table scans are contained.
It is built on top of BigTable. BigTable is a multi-dimensional, sparse, sorted map table approach used in conjunction with Map/Reduce. The following table shows the multi layered approach of Percolator:
Figure 2: Percolator architecture Google research
Data science is the art form of exploring data and attacking it from different angles to get new and different insights. However MapReduce is built for organised processing of jobs. And the volume of coding and level of expertise required is intense. This approach is not suitable for the type of ad-hoc style analysis over large data sets as required for data scientists. Just like I highlighted above, MapReduce jobs aren’t particularly the fastest things in the world hence they don’t lend themselves to ad-hoc iterative exploration of large data sets.
Once again the team at Google came up with Dremel. In fact it’s been around since 2006 and is used by thousands of users at Google, so it’s not really new technology per se, however it does represent the future. Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. Using a combination of multi-level execution trees and a columnar data architecture, it is capable of running a query over trillion row tables in seconds.
The sheer scalability of Dremel is truly impressive with the ability to scale across thousands of CPU’s and petabytes of data. From the testing that Google has performed the results has demonstrated that it is about 100 times faster than MapReduce.
In recent times Dremel has inspired the development of Hadoop expansions such as Apache drill and Cloudera Impala and expect them to become more and more prevalent within Enterprises as deployments of Hadoop become more advanced.
So you may well ask is Hadoop finished? Well not really but it is evolving. It is adapting to the needs of modern day enterprises with speed of analytics a primary driver of these advancements.
It is no surprise that Google has been a key driver in the incubation of these new techniques as it is our use of the internet that has given rise to the need for these approaches. We are creating more and more data everyday but at the same time we need to analyse the data at a faster rate. Hadoop was developed essentially in another era when the volume of data was smaller and the need for speed was lower.
So we will continue to see new and wonderful ways to tackle the data problem and these will eventually make their ways into the products we use today.
Ben Davis is a Senior Architect for Teradata Australia based in Canberra. With 18 years of experience in consulting, sales and technical data management roles, he has worked with some of the largest Australian organisations in developing comprehensive data management strategies. He holds a Degree in Law, A post graduate Masters in Business and Technology and is currently finishing his PhD in Information Technology with a thesis in executing large scale algorithms within cloud environments.
Latest posts by Ben Davis (see all)
- Mastering colours in your data visualisations - March 8, 2017
- Spotting the pretenders in Data Science - February 15, 2017
- Leveraging all Data in a Government/Client Engagement - November 15, 2016
- Can we defeat DDoS using analytics? - August 15, 2016
- The pitfalls of DIY Hadoop - August 8, 2016