ebay

 

The recent webinar by Richard Winter and Bob Page hammered home key lessons about the cost of workloads running on Hadoop and data warehouses.  Richard runs WinterCorp -- a consulting company that has been implementing huge data warehouses for 20+ years.   Bob Page is Vice President of Products for Hortonworks, and before that he was at Yahoo! and eBay running big data projects.  The webinar explored Richard’s cost model for running various workloads on Hadoop and an enterprise data warehouse (EDW).  Richard built the cost model during a consulting engagement with a marketing executive of a large financial services company who was launching a big data initiative.  She had people coming to her saying “you should do it in Hadoop” and others saying “you should do it in the data warehouse.”  Richard’s cost model helped her settle some debates.

The Total Cost of Data analysis results are the basis for the webinar.  What separates Richard’s cost framework from most others is that it includes more than just upfront system costs.  The TCOD cost model also includes five years of programmer labor, data scientist labor, end user labor, maintenance upgrades, plus power and cooling.  Richard said there are 60 costs metrics in the model.  He recommends companies download the TCOD spreadsheet and insert actual local costs since system and labor costs differ by city and country.

For the Hadoop data lake workload (aka. data refinery), labor costs were fairly close between Hadoop and the data warehouse while system costs favored Hadoop.  In the case of the data warehouse workload, the data warehouse system cost was high (remember the power and cooling?) while the Hadoop labor cost structure skyrocketed.  Long story short, Hadoop as a data lake is lower cost than a data warehouse; and the data warehouse is lower cost for complex queries and analytics.

There was general agreement that Hadoop is a cost effective platform for ETL work – the staging of raw data and transforming it into refined value.   But when asked “should we offload ELT/ETL to Hadoop?” Bob Page said:

I think it’s going to be data dependent.  It also depends on what the skills are in the organization.  I experienced it myself when I was running big data platforms.  If there is a successful implementation on the EDW today, there may be a couple reasons why it makes sense to keep it there.  One reason is there may be years and years of business logic encoded, debugged, and vetted.  Moving that to another platform with its inherent differences, you might ask “what’s the value of doing that?” It may take a couple years to get that right and in the end all you have done is migrate to another platform.  I would prefer to invest those resources in adding additional value to the organization rather than moving sideways to another platform.”

 


When the data warehouse workload was costed out, Hadoop’s so called $1000 per terabyte turned out to be an insignificant part of the total.  However, Hadoop’s cost skyrockets because of the need for 1000s of queries being manually coded by high priced Hadoop and moderate priced Java programmers over five years.  The OPEX side of the pie chart was huge when the data warehouse workload was applied to Hadoop.

Richard explained:

The total cost of queries are much lower on the EDW than on Hadoop. SQL is a declarative language – you only have to tell it what you want.  In Hadoop you use a procedural language.  In Hadoop you have to tell the system how to find the data, how to bring it together, and what are the manipulations needed to deliver the results.  With the data warehouse, there is a sophisticated query optimizer that figures all that out automatically for you.  The cost of developing the query on the data warehouse is lower because of the automation provided.”

 

Given the huge costs for Hadoop carrying a data warehouse workload, I asked Bob if he agreed with Richard’s assessment. “Does it pass the sniff test?” I asked. Bob Page replied:

“We don’t see anybody today trying to build an EDW with Hadoop. This is a capability issue not a cost issue. Hadoop is not a data warehouse. Hadoop is not a database. Comparing these two for an EDW workload is comparing apples to oranges. I don’t know anybody who would try to build an EDW in Hadoop. There are many elements of the EDW on the technical side that are well refined and have been for 25 years. Things like workload management, the way concurrency works, and the way security works -- there are many different aspects of a modern EDW that you are not going to see in Hadoop today. I would not see these two as equivalent. So –no– it doesn’t pass the sniff test.”

Bob’s point – in my opinion – is the Hadoop-as-EDW cost model is invalid since Hadoop is not designed to handle EDW workloads.   Richard said he “gave Hadoop the benefit of the doubt” but I suspect the comparison was baked into his consulting contract with the Marketing CMO woman.  Ultimately, Richard and Bob agree from different angles.

There are a lot of press articles and zealots on the web who will argue these results.  But Richard and Bob have the hands-on credentials far beyond most people.  They have worked with dozens of big data implementations from 500TB to 10s of petabytes.  Please spend the time to listen to their webinar for an unbiased view.  The biased view – me – didn’t say all that much during the webinar.

Many CFO’s and CMO’s are grappling with the question “When do we use Hadoop and when should we use the data warehouse?”  Pass them the webinar link, call Richard, or call Bob.

 

Total Cost of Data Webinar

Big Data—What Does It Really Cost? (white paper)

The Real Cost of Big Data (Spreadsheet)

TCOD presentation slides (PDF)

Big Insights from Big Analytics Roadshow

Posted on: January 25th, 2013 by Teradata Aster No Comments

 

Last month in New York we completed the 4th and final event in the Big Analytics 2012 roadshow. This series of events shared ideas on practical ways to address the big data challenge in organizations and change the conversation from “technology” to “business value”. In New York alone, 500 people attended from across both business and IT and we closed out the event with two speaker panels. The data science panel was, in my opinion, one of the most engaging and interesting panels I’ve ever seen at an event like this. The topic was on whether organizations really need a data scientist (and what’s different about the skill set from other analytic professionals). Mike Gualtieri from Forrester Research did a great job leading & prodding the discussion.

Overall, these events were a great way to learn and network. The events had great speakers from cutting-edge companies, universities, and industry thought-leaders including LinkedIn, DJ Patil, Barnes & Noble, Razorfish, Gilt Groupe, eBay, Mike Gualtieri from Forrester Research, Wayne Eckerson, and Mohan Sawhney from Kellogg School of Management.

As an aside, I’ve long observed that there has been a historic disconnect between marketing groups and the IT organizations and data warehouses that they support. I noticed this first when I worked at Business Objects where very few reporting applications ever included Web clickstream data. The marketing department always used a separate tool or application like Web Side Story (now part of Adobe) to handle this. There is a bridge being built to connect these worlds – both in terms of technology which can handle web clickstream and other customer interactional data, but also new analytic techniques which make it easier for marketing/business analysts to understand their customers more intimately and better serve them a relevant experience.

We ran a survey at the events, and I wanted to share some top takeaways. The events were split into business and technical tracks with themes of “data science” and “digital marketing”. Thus, the survey data compares the responses from those who were more interested in technology than the business content, so we can compare their responses. The survey data includes responses from 507 people in San Francisco, 322 in Boston, 441 in Chicago, and 894 in New York City for a total of 2164 respondents.

You can get the full set of graphs here, but here are a couple of my own observations / conclusions in looking at the data:

1)      “Who is talking about big data analytics in your organization?” - IT and Marketing were by far the largest responses with nearly 60% of IT organizations and 43% of marketing departments talking about it. New York had slightly higher # of CIO’s and CEO’s talking about big data at 23 and 21%, respectively

 Survey Data: Figure 1

 

 

 


 

 

 

 

 

 

 

2)      “Where is big data analytics in your company” - Across all cities, “customer interactions in Web/social/mobile” was 62% - the biggest area of big data analytics. With all the hype around machine/sensor data, it was surprisingly only being discussed in 20% of organizations. Since web servers and mobile devices are machines, it would have been interesting to see how the “machine generated data” responses would have been if we had taken the more specific example of customer interactions away

 Survey Data: Figure 2

 

 

 

 


 

 

 

 

 

 

3)      This chart is a more detailed breakdown of the areas where big data analytics is found, broken down by city. NYC has a few more “other.” Some of the “other” answers in NYC included:

  1. Claims
  2. Client Data Cloud
  3. Development, and Data Center Systems
  4. Customer Solutions
  5. Data Protection
  6. Education
  7. Financial Transaction
  8. Healthcare data
  9. Investment Research
  10. Market Data
  11.  Predictive Analytics (sales and servicing)
  12. Research
  13. Risk management /analytics
  14. Security

 Survey Data: Figure 3

 

 

 

 

 

 


 

 

 

 

4)      “What are the Greatest Big Analytics Application Opportunities for Businesses Today? – on average, general “data discovery or data science” was highest at 72%, with “digital marketing optimization” as second with just under 60% of respondents. In New York, “fraud detection and prevention” at 39% was slightly higher than in other cities, perhaps tied to the # of financial institutions in attendance

 Survey Data: Figure 4

 


 

 

 

 

 

 

 

 

 

In summary, there are lots of applications for big data analytics, but having a discovery platform which supports iterative exploration of ALL types of data and can support both business/marketing analysts as well as savvy data scientists is important. The divide between business groups like marketing and IT are closing. Marketers are more technically savvy and the most demanding for analytic solutions which can harness the deluge of customer interaction data. They need to partner closely with IT to architect the right solutions which tackle “big analytics” and provide the right toolsets to give the self-service access to this information without always requiring developer or IT support.

We are planning to sponsor the Big Analytics roadshow again in 2013 and take it international, as well. If you attended the event and have feedback or requests for topics, please let us know. I hear that there will be a “call for papers” going out soon. You can view the speaker bios & presentations from the Big Analytics 2012 events for ideas.