Hype, Hadoop and The Logical Data Warehouse

Posted on: June 24th, 2013 by Martin Willcox 1 Comment

“Hadoop has fundamentally transformed the economics of data management, making it possible to choose to keep all (of) one’s data, without an exorbitant, on-going investment in a cumbersome technology that can’t keep pace with the growth of data or the evolving needs of a business.”
Mike Olson (Cloudera), quoted from Warehouse = Relic?

Cloudera’s Mike Olson (congratulations on your new job, Mike) has been making some pretty big claims for Hadoop again recently that are intended to imply that parallel RDBMS technology is “legacy” and that Hadoop – and specifically the Cloudera distro - is the future of information management.  This is ground that we have gone over in this blog before now, so I will be brief.

We can all agree that Hadoop has dramatically changed the economics of storing large volumes of noisy multi-structured data – like text documents, web-logs and machine logs - and of pre-processing them to strip out some that noise.  And we can all agree that this means that the technology has a whole host of applications and will enable us to extend Enterprise Analytics to include new sources of data and new types of analytics.  But we shouldn’t get too carried away, because whilst the unit cost of storing these data may be falling, the multiplier is also increasing rapidly, as more-and-more data become available.

Another reason that we need to be careful where some of these claims are concerned is that cost of storage is a (very) poor proxy for cost of processing, which in turn is not the same thing as total cost of ownership (TCO).  This matters, because if you go to the time, trouble and expense of storing data, I assume that you plan to process it at least once.  And if you don’t, throw it away and reduce your cost of storage to zero.

Teradata customer eBay presented research at the 2011 XLDB conference that demonstrated clearly that the relative cost and performance of processing data in Hadoop and in a parallel, shared-nothing RDBMS environment varies wildly, depending on the nature of the processing.  These are technologies - and technologies have sweet spots.  The problem that Hadoop was invented to solve - “brute force” word counting and text indexing – is very different from the selection, projection, joining, sorting and aggregation of relational data that characterises much of “traditional” Business Intelligence and Analytics.  Olson knows all of this to be true, which is why Cloudera is pursuing the Impala initiative, which, at least in part, is intended to provide improved performance for simple reporting workloads.  Cloudera, by the way, will try and achieve this by having Impala by-pass the MapReduce layer in Hadoop altogether and run directly against the Hadoop Distributed File System (HDFS) - in the process creating what has been described to me as a “proprietary open source software stack”.  Please don’t ask me what that means, because I am a simple man and have always laboured under the impression that software is either one - or the other.

Successful Information Management is also about much more than minimizing the cost of acquiring and processing data.  Facebook - one of the first and most aggressive adopters of Hadoop technology – has discussed recently at TDWI the merits of using Hadoop to augment and extend relational technology.  For Facebook, at least, “Big Data = Hadoop + Relational” – what Facebook’s Director of Analytics, Ken Rudin, calls “the genius of AND versus the tyranny of OR.  Those of us that have been in the Enterprise Analytics industry for several decades know the folly of the “build it and they will come” approach.  Ordinary end-users – the ones that can’t program in Java and don’t have a PhD in statistics, i.e.: 99% of them - need systems that provide high-levels of performance, scalability, flexibility, security and above all usability - and that support cross-functional analytics on integrated data.  The Hadoop ecosystem has a long way to go before it can match the leading Analytic RDBMS products on any of these measures, at least for the aforementioned “traditional” decision-support processing - and certainly if we think of “scalability” in terms of numbers of concurrent users, complexity of physical data models, complexity of queries, etc., etc., rather than just in terms of mere data volumes.  Hadoop is, after all, a distributed file system and a parallel programming model, not a DBMS; and even rudimentary cost-based optimisation, for example, is a roadmap item for Impala.

Hadoop is cool and it is here to stay - as one component of an Enterprise Analytical Architecture.  We are incredibly excited about the potential of the technology to extend the Integrated Data Warehouse – watch this space for new announcements coming very soon that are testament to that excitement.  But because the “big data” problem space is not homogenous and because the technologies are just that – technologies, with overlapping sweet spots – successful approaches to managing Analytic information during the next decade will require us to deploy multiple platforms.  Analyst group Gartner calls this evolution in Enterprise Analytical Architecture the “logical data warehouse” – and the same logic has driven the development of our own Unified Data Architecture.  Any technology can be made to look “cumbersome” and “exorbitantly expensive” if used inappropriately – witness this benchmark, which demonstrates how Teradata-Aster was able to out-perform Hadoop by approximately 10x, on average, for clickstream analytics.  Might an 80-node Hadoop cluster have been able to match the performance of the 8-node Teradata-Aster system used in these tests?  Maybe, for the brute-force parallel scanning queries (which in any real-world implementation is not all of them, incidentally) – but an 80-node Hadoop cluster is not “free”, either to buy, to support or to run.

All of which is why Hadoop will change the world, but won’t displace all of the other “big data” technologies in the process.  Big Data are plural – and managing and exploiting them effectively is about AND, not OR.

Martin Willcox

One Response

  1. Paul Johnson

    June 25, 2013

    “Ordinary end-users – the ones that can’t program in Java and don’t have a PhD in statistics, i.e.: 99% of them”.

    Sorry to disappoint Martin, but *no way* do 1% of end users know Java and have a PhD in stats. In fact, I’ve never met anyone that claims both.

    As Stephen Brobst points out, a ‘data scientist’ is an analyst that lives in Silicon Valley. Tee-hee!

    End users expect to use SQL, plain and simple. SQL is a bolt-on for Hadoop, and not a very good one at that.

    As you say, each technology has a sweet spot and should be viewed and used accordingly.

    However, I’d agree that parallel RDBMS technology *is* legacy. The fact that I’ve been using Teradata for over 20 years supports that assertion. It doesn’t mean it has no value or is a dead end though.

    Linear scalability with enterprise class resilience and performance in support of complex ad hoc join, aggregate, sort and scan operations in a mixed workload environment is very, very hard to do.

    Hadoop is here to stay, no doubt about that, but it has to be used appropriately alongside other technologies.

    For a lot of folks Hadoop is currently a solution looking for a problem and ceetianly not a viable replacement for an enterprise RDBMS, parallel or otherwise.

    Reply

Leave a comment

*