Following my blog series on “Big Data is Not Hadoop” several people asked whether Hadoop will replace the Data Warehouse enterprise environment.
For those of you who prefer short answers: my answer is an emphatic NO. But do read on…
To understand why, let’s first look at the reasons for having a Data Warehouse:
- It’s too hard to access the source systems one by one
- With a Data Warehouse you integrate once, use many times
- The Data Warehouse is system-independent; data survives source system changes
- The Data Warehouse can keep history
- A single point of access (and with a bit of luck, therefore, a single version of the truth)
- Flexibility to support canned and ad-hoc queries
Now, a Hadoop solution is as good as a traditional Data Warehouse for the first four points. This is because all of these are business reasons and are agnostic to technological approach.
But what about the last two?
First: Can Hadoop provide a single version of the truth?
Well, it is not designed for that, but it can be made to provide it. Hadoop does not, out of the box, provide transactionality. This is not a problem for most traditional uses of Hadoop (when I analyse Twitter responses, losing a few twits has no real effect on the result). But if your Data Warehouse provides your company’s balance sheet, you will have to ensure that every single transaction is successful.
Second: can Hadoop provide the required performance in a mixed-load environment?
For two reasons: performance and completeness of solution.
Hadoop is a glorified file system. RDBMS performance stems from the sophistication of its optimizer. IBM, Oracle, Teradata and the rest have invested a huge amount of time in creating a robust optimizer. If Hadoop is to compete, someone will need to devise a SQL optimizer for Hadoop, one that can compete with the likes of IBM, Oracle, Microsoft and Teradata.
At present there are several attempts at doing just that. But it’s early days. It will take years for Hadoop to develop a competitive optimizer that handles mixed-load queries over a highly distributed database. By then, the existing optimizers will leap-frog it and still provide a better enterprise solution.
The other problem is completeness of solutions. A Data Warehouse is not just a fast database. It must supply enterprise level scheduling, management, security, recovery, interoperability and support. I don’t see this provided by the open-source community. And once it is taken-over by the traditional vendor, it loses the appeal. I only need to mention MySQL.
So there you are. I said it. Hadoop will NOT replace the Data Warehouse. Let’s hope my statement will go down in history together with Ken Olsen’s famous “There is no reason for any individual to have a computer in his home”.
However, to quote Albert Einstein: “No amount of experimentation can ever prove me right; a single experiment can prove me wrong.”