Your Data Warehouse on Hadoop

Monday May 26th, 2014

Following my blog series on “Big Data is Not Hadoop” several people asked whether Hadoop will replace the Data Warehouse enterprise environment.

For those of you who prefer short answers:  my answer is an emphatic NO.  But do read on…

To understand why, let’s first look at the reasons for having a Data Warehouse:

  • It’s too hard to access the source systems one by one
  • With a Data Warehouse you integrate once, use many times
  • The Data Warehouse is system-independent; data survives source system changes
  • The Data Warehouse can keep history
  • A single point of access (and with a bit of luck, therefore, a single version of the truth)
  • Flexibility to support canned and ad-hoc queries

Now, a Hadoop solution is as good as a traditional Data Warehouse for the first four points.  This is because all of these are business reasons and are agnostic to technological approach.

But what about the last two?

First: Can Hadoop provide a single version of the truth?

Well, it is not designed for that, but it can be made to provide it.  Hadoop does not, out of the box, provide transactionality.  This is not a problem for most traditional uses of Hadoop (when I analyse Twitter responses, losing a few twits has no real effect on the result).  But if your Data Warehouse provides your company’s balance sheet, you will have to ensure that every single transaction is successful.

Second: can Hadoop provide the required performance in a mixed-load environment?

This is where my answer is emphatically NO.

For two reasons: performance and completeness of solution.

Hadoop is a glorified file system.  RDBMS performance stems from the sophistication of its optimizer.  IBM, Oracle, Teradata and the rest have invested a huge amount of time in creating a robust optimizer.  If Hadoop is to compete, someone will need to devise a SQL optimizer for Hadoop, one that can compete with the likes of IBM, Oracle, Microsoft and Teradata.

At present there are several attempts at doing just that.  But it’s early days.  It will take years for Hadoop to develop a competitive optimizer that handles mixed-load queries over a highly distributed database.  By then, the existing optimizers will leap-frog it and still provide a better enterprise solution.

The other problem is completeness of solutions.  A Data Warehouse is not just a fast database.  It must supply enterprise level scheduling, management, security, recovery, interoperability and support.  I don’t see this provided by the open-source community.  And once it is taken-over by the traditional vendor, it loses the appeal.  I only need to mention MySQL.

So there you are.  I said it.  Hadoop will NOT replace the Data Warehouse.  Let’s hope my statement will go down in history together with Ken Olsen’s famous “There is no reason for any individual to have a computer in his home”.

 

 

However, to quote Albert Einstein:  “No amount of experimentation can ever prove me right; a single experiment can prove me wrong.”

The following two tabs change content below.
avatar

Ben Bor

Senior Solutions Architect at Teradata
Ben Bor is a Senior Solutions Architect at Teradata ANZ, specialist in maximising the value of enterprise data. He gained international experience on projects in Europe, America, Asia and Australia. Ben can count some of the largest international conglomerates amongst his clients, including the UK tax office, Shell, Exxon, Credit Suisse, QBE, Woolworths, Westpac and others. Ben is an international presenter on Information Management (IM) topics, having presented in Europe, Asia, USA, Canada, NZ and Australia on IM topics ranging from performance through data warehousing and data quality to Social Media analysis and Big Data. Ben has over 30 years’ experience in the IT industry (he wrote his first program in 1969, using punched cards). Prior to joining Teradata, Ben worked for international consultancies for about 15 years (including CapGemini, HP and Logica) and for international banks before that
Category: Ben Davis Tags: , , ,
avatar

About Ben Bor

Ben Bor is a Senior Solutions Architect at Teradata ANZ, specialist in maximising the value of enterprise data. He gained international experience on projects in Europe, America, Asia and Australia. Ben can count some of the largest international conglomerates amongst his clients, including the UK tax office, Shell, Exxon, Credit Suisse, QBE, Woolworths, Westpac and others. Ben is an international presenter on Information Management (IM) topics, having presented in Europe, Asia, USA, Canada, NZ and Australia on IM topics ranging from performance through data warehousing and data quality to Social Media analysis and Big Data. Ben has over 30 years’ experience in the IT industry (he wrote his first program in 1969, using punched cards). Prior to joining Teradata, Ben worked for international consultancies for about 15 years (including CapGemini, HP and Logica) and for international banks before that

One thought on “Your Data Warehouse on Hadoop

  1. avatarGlenn McCall

    Interesting Blog and I agree.
    Personally, I like to think of Hadoop not so much as a replacement or competitor to a “traditional warehouse”. Rather I like to think of it as complementary.
    I agree with you that Hadoop is essentially a file system. There are lots of tools overlayed on top of the file system that allow certain files to look like database tables (e.g. Hive), but at the end of the day, these can’t hope to compete with a fully fledged RDBMS (your “second” differentiator point).

    However, if we think of Hadoop as a tool that can be applied to different problem space we can think of it as complementary to Teradata. The key feature of Hadoop, IMHO, is MapReduce and to a lesser extent the distributed file system HDFS.

    The free form of data that a file system permits (i.e. HDFS), combined with the Massively Parallel Loosely coupled architecture that MapReduce enables and embodied in Hadoop (and for that matter Aster) allows us to examine different forms of data in different ways (in the case of Aster, analysis of structured data in different ways). Free form text files (e.g. documents, blogs, twitter posts and any number of other things) even images can be input to various algorithms including, but not limited to machine learning algorithms, to glean all sorts of new insights into customer sentiment, desires, patterns, relationships and other less quantifiable attributes that might be stored in the structured realm of an RDBMS such as Teradata. MapReduce allows these files to be processed in parallel (somewhat like AMP Worker Tasks process parts of a table in parallel in Teradata).

    So IMHO, Hadoop, when properly used, enables us to gain new insights from new, less traditional, data sources. When combined with the knowledge held in Teradata the sky is the limit.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *


*