Total Cost of Big Data: a CFO’s Lesson from WinterCorp and HortonWorks

The recent webinar by Richard Winter and Bob Page hammered home key lessons about the cost of workloads running on Hadoop and data warehouses.  Richard runs WinterCorp — a consulting company that has been implementing huge data warehouses for 20+ years.   Bob Page is Vice President of Products for Hortonworks, and before that he was at Yahoo! and eBay running big data projects.  The webinar explored Richard’s cost model for running various workloads on Hadoop and an enterprise data warehouse (EDW).  Richard built the cost model during a consulting engagement with a marketing executive of a large financial services company who was launching a big data initiative.  She had people coming to her saying “you should do it in Hadoop” and others saying “you should do it in the data warehouse.”  Richard’s cost model helped her settle some debates.

The Total Cost of Data analysis results are the basis for the webinar.  What separates Richard’s cost framework from most others is that it includes more than just upfront system costs.  The TCOD cost model also includes five years of programmer labor, data scientist labor, end user labor, maintenance upgrades, plus power and cooling.  Richard said there are 60 costs metrics in the model.  He recommends companies download the TCOD spreadsheet and insert actual local costs since system and labor costs differ by city and country.

For the Hadoop data lake workload (aka. data refinery), labor costs were fairly close between Hadoop and the data warehouse while system costs favored Hadoop.  In the case of the data warehouse workload, the data warehouse system cost was high (remember the power and cooling?) while the Hadoop labor cost structure skyrocketed.  Long story short, Hadoop as a data lake is lower cost than a data warehouse; and the data warehouse is lower cost for complex queries and analytics.

There was general agreement that Hadoop is a cost effective platform for ETL work – the staging of raw data and transforming it into refined value.   But when asked “should we offload ELT/ETL to Hadoop?” Bob Page said:

I think it’s going to be data dependent.  It also depends on what the skills are in the organization.  I experienced it myself when I was running big data platforms.  If there is a successful implementation on the EDW today, there may be a couple reasons why it makes sense to keep it there.  One reason is there may be years and years of business logic encoded, debugged, and vetted.  Moving that to another platform with its inherent differences, you might ask “what’s the value of doing that?” It may take a couple years to get that right and in the end all you have done is migrate to another platform.  I would prefer to invest those resources in adding additional value to the organization rather than moving sideways to another platform.”


When the data warehouse workload was costed out, Hadoop’s so called $1000 per terabyte turned out to be an insignificant part of the total.  However, Hadoop’s cost skyrockets because of the need for 1000s of queries being manually coded by high priced Hadoop and moderate priced Java programmers over five years.  The OPEX side of the pie chart was huge when the data warehouse workload was applied to Hadoop.

Richard explained:

The total cost of queries are much lower on the EDW than on Hadoop. SQL is a declarative language – you only have to tell it what you want.  In Hadoop you use a procedural language.  In Hadoop you have to tell the system how to find the data, how to bring it together, and what are the manipulations needed to deliver the results.  With the data warehouse, there is a sophisticated query optimizer that figures all that out automatically for you.  The cost of developing the query on the data warehouse is lower because of the automation provided.”


Given the huge costs for Hadoop carrying a data warehouse workload, I asked Bob if he agreed with Richard’s assessment. “Does it pass the sniff test?” I asked. Bob Page replied:

“We don’t see anybody today trying to build an EDW with Hadoop. This is a capability issue not a cost issue. Hadoop is not a data warehouse. Hadoop is not a database. Comparing these two for an EDW workload is comparing apples to oranges. I don’t know anybody who would try to build an EDW in Hadoop. There are many elements of the EDW on the technical side that are well refined and have been for 25 years. Things like workload management, the way concurrency works, and the way security works — there are many different aspects of a modern EDW that you are not going to see in Hadoop today. I would not see these two as equivalent. So –no– it doesn’t pass the sniff test.”

Bob’s point – in my opinion – is the Hadoop-as-EDW cost model is invalid since Hadoop is not designed to handle EDW workloads.   Richard said he “gave Hadoop the benefit of the doubt” but I suspect the comparison was baked into his consulting contract with the Marketing CMO woman.  Ultimately, Richard and Bob agree from different angles.

There are a lot of press articles and zealots on the web who will argue these results.  But Richard and Bob have the hands-on credentials far beyond most people.  They have worked with dozens of big data implementations from 500TB to 10s of petabytes.  Please spend the time to listen to their webinar for an unbiased view.  The biased view – me – didn’t say all that much during the webinar.

Many CFO’s and CMO’s are grappling with the question “When do we use Hadoop and when should we use the data warehouse?”  Pass them the webinar link, call Richard, or call Bob.


Total Cost of Data Webinar

Big Data—What Does It Really Cost? (white paper)

The Real Cost of Big Data (Spreadsheet)

TCOD presentation slides (PDF)

5 thoughts on “Total Cost of Big Data: a CFO’s Lesson from WinterCorp and HortonWorks

  1. Pingback: Total Cost of Big Data: a CFO’s Lesson fr...

  2. avatarDavid Sabater

    Quite good and detailed analysis – I will just mention that the assumption that 1 line of code in SQL DW is equivalent to 5 lines of code in MR/Java holds true in legacy Hadoop.
    With frameworks like Cascading/Cascalog/Scalding or Spark you can reduce this ratio to 1:1. If you change that in the spreadsheet you will see that Hadoop is a viable option even for EDW considering TCOD as well.
    Happy to be challenged with this 1:1 assumption :)


    1. avatarDan Graham Post author

      Hadoop will never be an EDW for a long list of reasons. Hadoop is not a cheap data warehouse nor is that a good vision. I prefer Hadoop as a DataHub/DataLake where its strengths are more obvious. Changing the apreadsheet does not change the fundamentals of Hadoop.
      Recently, Mike Olson of Cloudera, Chairman of Cloudera, endorsed the data warehouse as a peer in a larger unified data architecture (Strata Hadoop world 2014). Bob Page of Hortonworks said in the webinar that Hadoop is not nor can it be a data warehouse. The reasons lay mostly in the design pattern that defines what a data warehouse is and is not. See Bill Inmon and Gartner for good definitions. Wikipedia is kind of a mess on this topic.

      There are times that new languages may be more succinct than SQL. In today’s world, SQL is not written by programmers, it is generated by point-and-click tools by business users aka self-service. This allows the business people to learn about their data interactively. Thus its more costly and time consuming to write SQL than to use a GUI, especially when the SQL generated is 100-300 pages long. Hadoop is 3-5 years from having a simple form of this this in the various incarnations of SQL-on-Hadoop (Gartner, Curt Monash, Bloor, etc.).

      If you want to go deep on the SQL to Hadoop comparison, contact who authored the TCOD study.

      1. avatarDavid Sabater

        Hi Dan,
        Thanks for your reply.
        I do agree with you around Hadoop not replacing DWH as of today.
        Just wanted to point out the fact that one of the key arguments about Hadoop is the TCO to maintain/develop new analytics applications within the platform due to costly development skills. Luckily new frameworks are coming to the rescue to limit those costs, same thing we can say about SQL on Hadoop with Hive, Parquet, Impala, Spark SQL, HAWQ, etc…
        Will contact Richard as I want to be challenged on that assumption around ratio.



Leave a Reply

Your email address will not be published. Required fields are marked *