The recent webinar by Richard Winter and Bob Page hammered home key lessons about the cost of workloads running on Hadoop and data warehouses. Richard runs WinterCorp — a consulting company that has been implementing huge data warehouses for 20+ years. Bob Page is Vice President of Products for Hortonworks, and before that he was at Yahoo! and eBay running big data projects. The webinar explored Richard’s cost model for running various workloads on Hadoop and an enterprise data warehouse (EDW). Richard built the cost model during a consulting engagement with a marketing executive of a large financial services company who was launching a big data initiative. She had people coming to her saying “you should do it in Hadoop” and others saying “you should do it in the data warehouse.” Richard’s cost model helped her settle some debates.
The Total Cost of Data analysis results are the basis for the webinar. What separates Richard’s cost framework from most others is that it includes more than just upfront system costs. The TCOD cost model also includes five years of programmer labor, data scientist labor, end user labor, maintenance upgrades, plus power and cooling. Richard said there are 60 costs metrics in the model. He recommends companies download the TCOD spreadsheet and insert actual local costs since system and labor costs differ by city and country.
For the Hadoop data lake workload (aka. data refinery), labor costs were fairly close between Hadoop and the data warehouse while system costs favored Hadoop. In the case of the data warehouse workload, the data warehouse system cost was high (remember the power and cooling?) while the Hadoop labor cost structure skyrocketed. Long story short, Hadoop as a data lake is lower cost than a data warehouse; and the data warehouse is lower cost for complex queries and analytics.
There was general agreement that Hadoop is a cost effective platform for ETL work – the staging of raw data and transforming it into refined value. But when asked “should we offload ELT/ETL to Hadoop?” Bob Page said:
“I think it’s going to be data dependent. It also depends on what the skills are in the organization. I experienced it myself when I was running big data platforms. If there is a successful implementation on the EDW today, there may be a couple reasons why it makes sense to keep it there. One reason is there may be years and years of business logic encoded, debugged, and vetted. Moving that to another platform with its inherent differences, you might ask “what’s the value of doing that?” It may take a couple years to get that right and in the end all you have done is migrate to another platform. I would prefer to invest those resources in adding additional value to the organization rather than moving sideways to another platform.”
When the data warehouse workload was costed out, Hadoop’s so called $1000 per terabyte turned out to be an insignificant part of the total. However, Hadoop’s cost skyrockets because of the need for 1000s of queries being manually coded by high priced Hadoop and moderate priced Java programmers over five years. The OPEX side of the pie chart was huge when the data warehouse workload was applied to Hadoop.
“The total cost of queries are much lower on the EDW than on Hadoop. SQL is a declarative language – you only have to tell it what you want. In Hadoop you use a procedural language. In Hadoop you have to tell the system how to find the data, how to bring it together, and what are the manipulations needed to deliver the results. With the data warehouse, there is a sophisticated query optimizer that figures all that out automatically for you. The cost of developing the query on the data warehouse is lower because of the automation provided.”
Given the huge costs for Hadoop carrying a data warehouse workload, I asked Bob if he agreed with Richard’s assessment. “Does it pass the sniff test?” I asked. Bob Page replied:
“We don’t see anybody today trying to build an EDW with Hadoop. This is a capability issue not a cost issue. Hadoop is not a data warehouse. Hadoop is not a database. Comparing these two for an EDW workload is comparing apples to oranges. I don’t know anybody who would try to build an EDW in Hadoop. There are many elements of the EDW on the technical side that are well refined and have been for 25 years. Things like workload management, the way concurrency works, and the way security works — there are many different aspects of a modern EDW that you are not going to see in Hadoop today. I would not see these two as equivalent. So –no– it doesn’t pass the sniff test.”
Bob’s point – in my opinion – is the Hadoop-as-EDW cost model is invalid since Hadoop is not designed to handle EDW workloads. Richard said he “gave Hadoop the benefit of the doubt” but I suspect the comparison was baked into his consulting contract with the Marketing CMO woman. Ultimately, Richard and Bob agree from different angles.
There are a lot of press articles and zealots on the web who will argue these results. But Richard and Bob have the hands-on credentials far beyond most people. They have worked with dozens of big data implementations from 500TB to 10s of petabytes. Please spend the time to listen to their webinar for an unbiased view. The biased view – me – didn’t say all that much during the webinar.
Many CFO’s and CMO’s are grappling with the question “When do we use Hadoop and when should we use the data warehouse?” Pass them the webinar link, call Richard, or call Bob.