Gartner's "hype cycle" model for the deployment of new technology rests on the assumption that our industry typically over-hypes innovation, so that new technologies typically scream up "the peak of inflated expectations" - before crashing unceremoniously into "the trough of disillusionment". Select technologies eventually proceed – typically at a more leisurely pace – onto "the plateau of productivity". Others are "obsolete before maturity" - and quietly wither and die.
Any model is just that – an idealised representation of reality, not to be confused with reality itself – but I must admit to thinking several times recently that one of the most interesting and promising of the new technologies for information management and analytics is being propelled up the peak of inflated expectations at record speed by commentators and journalists who, frankly, should know much, much better. Most of you reading this will know exactly which technology I am referring to; and in the unlikely event that you have just returned from a trip to Mars, I am referring to the rash of simplistic and uncritical commentaries on the - undeniably impressive - progress of Hadoop.
There are broadly three issues with many of the articles on Hadoop that are flowering all over the trade press and the web at the moment.
MapReduce <> Hadoop <> Big Data
The first issue is that many of these commentaries insist on equating "MapReduce" with "Hadoop" and "Hadoop" with "Big Data". "Big Data" itself is rather a nebulous and ambiguous concept – which gets these articles off to a poor start - and in any serious discussion about "Big Data", we should distinguish between "lots of relational data" and "multi-structured (e.g.: text, web, audio, image and video) data". Few of these articles concern themselves with that distinction.
MapReduce is, of course, a programming model that enables complex processing logic expressed in Java and other programming languages to be parallelised efficiently, thus permitting their execution on "shared nothing", scale-out hardware architectures and Hadoop is one implementation of the MapReduce programming model. There are other implementations of the MapReduce model – and there are other approaches to parallel processing, which are a better fit with many classes of analytic problem. Again, few of these articles are clear on this point.
Out with the old, in with the new?
The second issue with the current crop of articles is that many of them position Hadoop as an alternative to existing, SQL-based technologies that is likely to displace – or even entirely replace – these technologies. In fact, as this article on the Apache web-site makes clear: "Apache Hadoop stores data in files, and does not index them… if you want to find something, you have to run a MapReduce job going through all the data… this takes time, and means that you cannot directly use Hadoop as a substitute for a database". And as well as representing an inefficient processing model for many operations, Hadoop also lacks important capabilities found in a mature and sophisticated data warehouse RDBMS, for example: query re-write and cost-based query optimisation; mixed-workload management; security, availability and recoverability features; support for transactions; etc., etc., etc.
There is, of course, a whole ecosystem springing-up around Hadoop – including HBase, Hive, Mahout and ZooKeeper, to name just four – and some commentators argue that in time these technologies may extend Hadoop to the point where this ecosystem could provide an alternative to existing Data Warehouse DBMS technology.
What we can say for certain is that we are a long way from this point today – and that optimizing the underlying Hadoop Distributed Filesystem (HDFS) for both the set-based analysis of relational data and the iterative processing of raw, multi-structured data is likely to prove a formidable technical challenge (it is a basic engineering tenet that you can't simultaneously optimize everything). It's also debatable whether a disparate set of extensions, designed and built by different groups with differing objectives, can ever become a cohesive whole in the way that these commentators envisage. Perhaps most importantly, since an understanding of how different data relate and can be compared with one another is fundamental to the efficient large-scale sharing of data, it is difficult to see Hadoop - which lacks a schema concept - evolving to support this kind of sharing. And sharing of consistent data is, after all, the raison d'être for any database.
It's Open Source so it's free, right?
The third issue with many of these commentaries builds on the second, by suggesting not only that Hadoop will shortly be functionally equivalent to existing Data Warehouse DBMS technology - but also that it will always be cheaper to deploy Hadoop-based solutions than "equivalent" DBMS-based solutions, because Hadoop is Open Source and is therefore "free". These assertions are typically supported by "unit cost of storage" or "cost-per-CPU-core" metrics.
These metrics are, of course, utterly irrelevant. If I need to deploy ten 50 TB "databases" on technology A to replace one 50 TB database built on technology B (for example, because I need to compensate for the lack of mixed-workload management in technology A and its inability to support multiple, concurrent queries), then even if the unit cost of technology A is one-tenth that of technology B, the acquisition costs are still a wash. And clearly the technology A–based deployment will necessarily have greatly increased systems integration, systems administration and management, power and cooling, etc., etc. costs - so that from a total cost of ownership (TCO) perspective, it will be substantially more expensive.
Similarly, understanding how efficiently different technologies use the hardware resources available to them in executing a given workload is critical to understanding the true cost of computation - and it follows that calculating an "acquisition-cost-per-CPU-core" tells us less-than-nothing about actual TCO for real-world workloads. It's not what you've got that matters, but rather how you use it.
For example, a major Teradata customer recently presented an interesting cost analysis at a European Teradata User Group (TUG) conference. This customer has three major analytical environments: a Hadoop cluster for storing raw weblog data and performing bulk transformation of these data; a system based on Teradata Extreme Data Appliance technology for the analysis of these weblog data when stored as key-value pairs; and an Active Enterprise Data Warehouse, based on Teradata's EDW-class technology. When the organization concerned calculated normalized cost-per-query metrics for these different environments, it found that its Teradata Extreme Data Appliance-based system is less than half as expensive as is its Hadoop-based system – and that the cost-per-query of the EDW and the Hadoop cluster are actually approximately equal.
What was most striking – to me, anyway - about this analysis was that the test query that was reviewed in detail was a simple table-scan-and-sum; you might reasonably expect that more complex queries that were less scan-oriented would increase still further the relative cost-per-query of the Hadoop-based system. Normalized performance comparisons, moreover, showed that the two Teradata systems were 20x and more than 100x faster, respectively, than the Hadoop system.
Similarly, I recently saw one of the Teradata-Aster product managers conduct an internal demonstration that consisted of side-by-side testing of a two-node Hive / Hadoop cluster and a Teradata Aster-based system built on identical commodity hardware. The test demonstrated that the Aster-based system was orders of magnitude faster than the "equivalent" Hive / Hadoop system for a given set of test queries (it was query latency, i.e.: individual query execution time, that was measured in these tests). You could, of course, build out a very large Hive / Hadoop cluster to achieve the same latency for these queries. But the idea that a cluster of hundreds or even thousands of commodity servers is "cheap" - either to deploy, maintain or operate - is clearly nonsense.
In defence of Hadoop
None of which is to say that Hadoop is not an extremely interesting and promising new technology – because clearly it is, and clearly a place on the plateau of productivity beckons. That major Teradata customer that I discussed earlier in this essay didn't end up with its very substantial Hadoop cluster by accident or through ignorance – and it isn't about to decommission it, either. Hadoop scales well - and Hadoop-based systems have a unit-cost-of-storage that will increasingly make it possible for organizations to "remember everything", by enabling them to retain even those data whose value for analytics is as yet unproven.
Furthermore, by enabling organizations to efficiently parallelize the complex algorithms required to process the new, multi-structured data that are the result of increasing digitization and the widespread deployment of sensor technology – and to generate relational meta-data from them that can more easily be integrated with the organization's existing, structured data assets - Hadoop will enable new types of analytics whose full potential is only just becoming apparent. In this scenario - what we might call "Big ETL" – Hadoop will become the processing infrastructure that enables us to process raw, multi-structured data and move it into a "Big Analytic" environment - like Teradata-Aster - that can more efficiently support high-performance, high concurrency manipulation of the data, whilst also providing for improved usability and manageability, so that we can bring these data to a wider audience. The final stage in this "Big Data value chain" will the see us move the insights derived from the processing of the raw multi-structured data in these "up stream" environments into the Data Warehouse, where they can most easily and most efficiently be combined with other data - and shared with the entire organization, so that we can maximise their business value.
All of which, of course, is why Teradata continues to invest in partnerships with leading Hadoop distributors Cloudera and Hortonworks - and to develop and enhance integration technology between these environments and the Teradata and Teradata-Aster platforms.
But amidst all the legitimate excitement about the potential of this innovative new technology, let's also be clear about four things –
- Hadoop is not an RDBMS – and it is unclear if the ecosystem growing-up around it will ever become one;
- Hadoop is neither designed nor optimized to support interactive, "speed of thought" queries and high levels of query concurrency;
- "Open source" does not mean free – and Hadoop may or may not be "cheap", depending on what it is we are trying to achieve;
- The smart money says that Hadoop and related technologies will extend and enhance Data Warehouse solutions built on existing parallel RDBMS technology, not replace them altogether.
Mind you, a casual reading of some of the more excitable industry commentators might also lead you to believe that the very idea of a Data Warehouse is an out-dated concept, destined very soon to go the way of the Dodo. But examining the merits of that particular discussion will have to wait for another time and another post.
Director of Platform & Solutions Marketing