total cost of data

 

The best Strata session that I attended was the overview Kurt Brown gave of the Netflix data platform, which contained hype-deflating lessons and many chestnuts of tech advice straight from one of the most intense computing environments on the planet.

Brown, who as a director leads the design and implementation of the data platform, had a cheerful demeanor but demonstrated ruthless judgment and keen insight in his assessment of how various technologies serve the goals of Netflix. It was interesting to me how dedicated he was to both MPP SQL technology and to Apache™ Hadoop.

I attended the session with Daniel Graham, Technical Marketing Specialist of Teradata, who spoke with me afterward about the implications of the Netflix architecture and Brown’s point of view.

SQL Vs Hadoop
Brown rejected the notion that it was possible to build a complete data platform exclusively using either SQL technology or Hadoop alone. In his presentation, Brown explained how Netflix made great use of Hadoop, used Hive for various purposes, and had an eye on Presto, but also couldn’t live without Teradata and Microstrategy as well.

Brown recalled a conversation in which another leader of a data platform explained that he was discarding all his data warehouse technology and going to put everything on Hive. Brown’s response, “Why would you ever want to do that?”

While Brown said he enjoyed the pressure that open source puts on commercial vendors to improve, he was dedicated to using whatever technology could provide answers to questions in the most cost-effective manner. Brown said he was especially pleased that Teradata was going to be able to support a cloud-based implementation that could run at scale. Brown said that Netflix had upwards of 5 petabytes of data in the cloud, all stored on Amazon S3.

After the session, I pointed out to Graham that the pattern in evidence at Netflix and most of the companies who are acknowledged as the leaders in big data, mimics the recommendation of the white paper “Optimize the Value of All Your Enterprise Data” that provides an overview of the Teradata Unified Data Architecture™.

The Unified Data Architecture recommends that that the data that has the most “business value density” be stored in an enterprise data warehouse powered by MPP SQL. This data is used most often by the most users. Hadoop is used as a data refinery to process flat files or NoSQL data in batch mode.

Netflix is a big data companies that arrived at this pattern by adding SQL to a Hadoop infrastructure. Many well-known users of huge MPP SQL installations have added Hadoop.

“Data doesn’t stay unstructured for long. Once you have distilled it, it usually has a structure that is well-represented by flat files,” said Teradata's Graham. “This is the way that the canonical model of most enterprise activity is stored. Then the question is: How you ask questions of that data? There are numerous ways to make this easy for users, but almost all of those ways pump out SQL that then is used to grab the data that is needed.”

Replacing MPP SQL with Hive or Presto is a non-starter because to really support hundreds or thousands of users who are pounding away at a lot of data, you need a way to provide speedy and optimized queries and also to manage the consumption of the shared resources.

“For over 35 years, Teradata has been working on making SQL work at scale for hundreds or thousands of people at a time,” said Graham. “It makes perfect sense to add SQL capability to Hadoop, but it will be a long time, perhaps a decade or more, before you will get the kind of query optimization and performance that Teradata provides. The big data companies use Teradata and other MPP SQL systems because they are the best tool for the job for making huge datasets of high business value density available to an entire company.”

Efforts such as Tez and Impala will clearly move Hive’s capability forward. The question is how far forward and how fast. We will know that victory has been achieved when Netflix, which uses Teradata in a huge cloud implementation, is able to support their analytical workloads with other technology.

Graham predicts that in 5 years, Hadoop will be a good data mart but will still have trouble with complex parallel queries.

“It is common for a product like Microstrategy to pump out SQL statements that may be 10, 20, or even 50 pages long,” said Graham. “When you have 5 tables, the complexity of the queries could be 5 factorial. With 50 tables, that grows to 50 factorial. Handling such queries is a 10- or 20-year journey. Handling them at scale is a feat that many companies can never pull off.”

Graham acknowledges the need for an MPP SQL data warehouse extended to support data discovery, e.g. Teradata Aster Discovery Platform, along with the extensions for using Hadoop and graph analytics through enhanced SQL, is needed by most businesses.

Teradata is working to demonstrate that the power of this collection of technology can address some of the unrealistic enthusiasm surrounding Hadoop.

By: Dan Woods, Forbes Blogger and Co-Founder of Evolved Media

 

Teradata has unveiled new certified solutions that dramatically enhance backup, recovery and DR capabilities for Teradata Database users.  The new certifications are with EMC’s latest industry-leading Data Domain deduplication storage systems and Teradata’s latest Data Stream Architecture (DSA), delivering new data protection options through Teradata’s Backup Archive and Recovery (BAR) solution for its database customers.

The Teradata/EMC relationship itself is not entirely new.  Teradata already incorporates Data Domain systems as part of its Advocated BAR data protection offering. That said, the relationship is much deeper as Teradata also sells, optimizes and takes Level 1 support for Data Domain when it is sold as part of its BAR solution.

Teradata’s certification of the new Data Domain DD7200 is the latest in a series of recent advances by Teradata and EMC-- specifically geared to deliver better data protection for Teradata Database customers.

A 1-2…3! Punch That Delivers Better Teradata Protection

Things got better for Teradata customers in October 2013 when Teradata announced its Data Stream Architecture (DSA), a new architecture designed to optimize data streaming from Teradata databases and increase performance of its BAR solution.  The DSA introduces features such as new stream layout and larger data blocks that improve Data Domain deduplication. Through close collaboration with EMC, Teradata also certified the Data Domain Operating System 5.4 in October 2013, and now the DD7200, bringing enormous backup performances improvements to customers.  The combination of the DSA architecture, the latest DD OS and the DD7200 protection storage system results in up to a 101% performance improvement over previous Data Domain-based Teradata BAR solution offerings.

As Chris Twogood, Vice President, Product and Services Marketing from Teradata, said, “Teradata and EMC are committed to providing the best backup solutions to our mutual customers.  This is why both companies engaged early on to complete certification of our latest respective technologies in concert with the release of Teradata’s new Data Stream Architecture. The combination of Teradata DSA with Data Domain is an effective and reliable backup and recovery solution for Teradata Databases and offers a fully automated and highly effective DR solution with network efficient replication.”

Today, Teradata supports both the DD890 and the new DD7200.  To date, many petabytes of data from Teradata Databases are being protected by Data Domain systems sold by Teradata as part of BAR.

For more information about Teradata BAR and Data Domain please:

  • Visit the Teradata BAR webpage, and
  • Read the Teradata Magazine article “Too Much of the Same” on leveraging data deduplication technologies for better Teradata Database protection
  • Read the Teradata Magazine article "Bionic Backup" for more details on Teradata Data Stream Architecture (DSA)

For additional insights into transformational data protection technologies and techniques, please visit the EMC backup and archive community.

By: Anselmo Barrero, Director of Business Development, EMC Data Protection and Availability Division

Data-Driven Business and the Need for Speed

Posted on: February 14th, 2014 by Guest Blogger No Comments

 

What stands in the way of effective use of data? Let’s face it: it’s people as much as technology. Silos of data and expertise cause friction in many companies. To create a winning strategy for data-driven business, you need fast, effective collaboration between teams, something much more like a pit crew than an org chart. To gain speed, you must find a way to break the silos between groups.

In most organizations, multiple teams are involved with data. Data may be used, passed onto another stage as an input, or both. Each use case involves end users, user interface designers, data scientists, business analysts, programmers, and operations people.

For BI to work optimally it must break through at least three silos:

  • Silo 1: BI architects
  • Silo 2: Data analysts
  • Silo 3: Business users

First, those focusing on data prep, programming and advanced analysis must not only work together but must be directed by the needs of the business users, ideally with a tight feedback loop. You don’t just want to get something in the hands of employees; you want to get the right thing in front of each of them, and that means communicating and understanding their needs.

Other types of silos exist between lines of business. For an insight to have maximum impact, it must find its way to those who need it most. Everyone must be on the lookout for signals that can be used by other lines of business and pass them along.

Silos Smashed: An End-to-End View
There’s no way to achieve an end-to-end view without breaking down silos. Usage statistics can show what types of analysis are popular, which increases transparency. Cross-functional teams that include all stakeholders (BI architects, data analysts, and business users) also help in breaking down barriers.

With the silos smashed, you can acquire value from data faster. Speed is key and breaking down silos can do for BI what pit crews do for racing. Data-driven business moves fast, and we must find more efficient ways of working together, cutting time off each iteration as we move toward real-time delivery of the right data to the right person at the right time.

For example, we should be able to see how changing a price affects customers and impacts the bottom line. And we should be able to do that on a store manager’s mobile device in real-time, not waiting until the back office runs the report and comes down from Mount Olympus with the answer. Immediacy of data for decision-making--that’s what drives competitive advantage.

While I'm at Strata, I'll be looking for new ideas about how to break the silos and speed time to value from data. I’m interested to hear your thoughts as well.

By: Dan Woods, Forbes Blogger and Co-Founder of Evolved Media

To learn more:

Optimize the Value of All the Data white paper
The Intelligent Enterprise infographic

 

The recent webinar by Richard Winter and Bob Page hammered home key lessons about the cost of workloads running on Hadoop and data warehouses.  Richard runs WinterCorp -- a consulting company that has been implementing huge data warehouses for 20+ years.   Bob Page is Vice President of Products for Hortonworks, and before that he was at Yahoo! and eBay running big data projects.  The webinar explored Richard’s cost model for running various workloads on Hadoop and an enterprise data warehouse (EDW).  Richard built the cost model during a consulting engagement with a marketing executive of a large financial services company who was launching a big data initiative.  She had people coming to her saying “you should do it in Hadoop” and others saying “you should do it in the data warehouse.”  Richard’s cost model helped her settle some debates.

The Total Cost of Data analysis results are the basis for the webinar.  What separates Richard’s cost framework from most others is that it includes more than just upfront system costs.  The TCOD cost model also includes five years of programmer labor, data scientist labor, end user labor, maintenance upgrades, plus power and cooling.  Richard said there are 60 costs metrics in the model.  He recommends companies download the TCOD spreadsheet and insert actual local costs since system and labor costs differ by city and country.

For the Hadoop data lake workload (aka. data refinery), labor costs were fairly close between Hadoop and the data warehouse while system costs favored Hadoop.  In the case of the data warehouse workload, the data warehouse system cost was high (remember the power and cooling?) while the Hadoop labor cost structure skyrocketed.  Long story short, Hadoop as a data lake is lower cost than a data warehouse; and the data warehouse is lower cost for complex queries and analytics.

There was general agreement that Hadoop is a cost effective platform for ETL work – the staging of raw data and transforming it into refined value.   But when asked “should we offload ELT/ETL to Hadoop?” Bob Page said:

I think it’s going to be data dependent.  It also depends on what the skills are in the organization.  I experienced it myself when I was running big data platforms.  If there is a successful implementation on the EDW today, there may be a couple reasons why it makes sense to keep it there.  One reason is there may be years and years of business logic encoded, debugged, and vetted.  Moving that to another platform with its inherent differences, you might ask “what’s the value of doing that?” It may take a couple years to get that right and in the end all you have done is migrate to another platform.  I would prefer to invest those resources in adding additional value to the organization rather than moving sideways to another platform.”

 


When the data warehouse workload was costed out, Hadoop’s so called $1000 per terabyte turned out to be an insignificant part of the total.  However, Hadoop’s cost skyrockets because of the need for 1000s of queries being manually coded by high priced Hadoop and moderate priced Java programmers over five years.  The OPEX side of the pie chart was huge when the data warehouse workload was applied to Hadoop.

Richard explained:

The total cost of queries are much lower on the EDW than on Hadoop. SQL is a declarative language – you only have to tell it what you want.  In Hadoop you use a procedural language.  In Hadoop you have to tell the system how to find the data, how to bring it together, and what are the manipulations needed to deliver the results.  With the data warehouse, there is a sophisticated query optimizer that figures all that out automatically for you.  The cost of developing the query on the data warehouse is lower because of the automation provided.”

 

Given the huge costs for Hadoop carrying a data warehouse workload, I asked Bob if he agreed with Richard’s assessment. “Does it pass the sniff test?” I asked. Bob Page replied:

“We don’t see anybody today trying to build an EDW with Hadoop. This is a capability issue not a cost issue. Hadoop is not a data warehouse. Hadoop is not a database. Comparing these two for an EDW workload is comparing apples to oranges. I don’t know anybody who would try to build an EDW in Hadoop. There are many elements of the EDW on the technical side that are well refined and have been for 25 years. Things like workload management, the way concurrency works, and the way security works -- there are many different aspects of a modern EDW that you are not going to see in Hadoop today. I would not see these two as equivalent. So –no– it doesn’t pass the sniff test.”

Bob’s point – in my opinion – is the Hadoop-as-EDW cost model is invalid since Hadoop is not designed to handle EDW workloads.   Richard said he “gave Hadoop the benefit of the doubt” but I suspect the comparison was baked into his consulting contract with the Marketing CMO woman.  Ultimately, Richard and Bob agree from different angles.

There are a lot of press articles and zealots on the web who will argue these results.  But Richard and Bob have the hands-on credentials far beyond most people.  They have worked with dozens of big data implementations from 500TB to 10s of petabytes.  Please spend the time to listen to their webinar for an unbiased view.  The biased view – me – didn’t say all that much during the webinar.

Many CFO’s and CMO’s are grappling with the question “When do we use Hadoop and when should we use the data warehouse?”  Pass them the webinar link, call Richard, or call Bob.

 

Total Cost of Data Webinar

Big Data—What Does It Really Cost? (white paper)

The Real Cost of Big Data (Spreadsheet)

TCOD presentation slides (PDF)