The best Strata session that I attended was the overview Kurt Brown gave of the Netflix data platform, which contained hype-deflating lessons and many chestnuts of tech advice straight from one of the most intense computing environments on the planet.
Brown, who as a director leads the design and implementation of the data platform, had a cheerful demeanor but demonstrated ruthless judgment and keen insight in his assessment of how various technologies serve the goals of Netflix. It was interesting to me how dedicated he was to both MPP SQL technology and to Apache™ Hadoop.
I attended the session with Daniel Graham, Technical Marketing Specialist of Teradata, who spoke with me afterward about the implications of the Netflix architecture and Brown’s point of view.
SQL Vs Hadoop
Brown rejected the notion that it was possible to build a complete data platform exclusively using either SQL technology or Hadoop alone. In his presentation, Brown explained how Netflix made great use of Hadoop, used Hive for various purposes, and had an eye on Presto, but also couldn’t live without Teradata and Microstrategy as well.
Brown recalled a conversation in which another leader of a data platform explained that he was discarding all his data warehouse technology and going to put everything on Hive. Brown’s response, “Why would you ever want to do that?”
While Brown said he enjoyed the pressure that open source puts on commercial vendors to improve, he was dedicated to using whatever technology could provide answers to questions in the most cost-effective manner. Brown said he was especially pleased that Teradata was going to be able to support a cloud-based implementation that could run at scale. Brown said that Netflix had upwards of 5 petabytes of data in the cloud, all stored on Amazon S3.
After the session, I pointed out to Graham that the pattern in evidence at Netflix and most of the companies who are acknowledged as the leaders in big data, mimics the recommendation of the white paper “Optimize the Value of All Your Enterprise Data” that provides an overview of the Teradata Unified Data Architecture™.
The Unified Data Architecture recommends that that the data that has the most “business value density” be stored in an enterprise data warehouse powered by MPP SQL. This data is used most often by the most users. Hadoop is used as a data refinery to process flat files or NoSQL data in batch mode.
Netflix is a big data companies that arrived at this pattern by adding SQL to a Hadoop infrastructure. Many well-known users of huge MPP SQL installations have added Hadoop.
“Data doesn’t stay unstructured for long. Once you have distilled it, it usually has a structure that is well-represented by flat files,” said Teradata's Graham. “This is the way that the canonical model of most enterprise activity is stored. Then the question is: How you ask questions of that data? There are numerous ways to make this easy for users, but almost all of those ways pump out SQL that then is used to grab the data that is needed.”
Replacing MPP SQL with Hive or Presto is a non-starter because to really support hundreds or thousands of users who are pounding away at a lot of data, you need a way to provide speedy and optimized queries and also to manage the consumption of the shared resources.
“For over 35 years, Teradata has been working on making SQL work at scale for hundreds or thousands of people at a time,” said Graham. “It makes perfect sense to add SQL capability to Hadoop, but it will be a long time, perhaps a decade or more, before you will get the kind of query optimization and performance that Teradata provides. The big data companies use Teradata and other MPP SQL systems because they are the best tool for the job for making huge datasets of high business value density available to an entire company.”
Efforts such as Tez and Impala will clearly move Hive’s capability forward. The question is how far forward and how fast. We will know that victory has been achieved when Netflix, which uses Teradata in a huge cloud implementation, is able to support their analytical workloads with other technology.
Graham predicts that in 5 years, Hadoop will be a good data mart but will still have trouble with complex parallel queries.
“It is common for a product like Microstrategy to pump out SQL statements that may be 10, 20, or even 50 pages long,” said Graham. “When you have 5 tables, the complexity of the queries could be 5 factorial. With 50 tables, that grows to 50 factorial. Handling such queries is a 10- or 20-year journey. Handling them at scale is a feat that many companies can never pull off.”
Graham acknowledges the need for an MPP SQL data warehouse extended to support data discovery, e.g. Teradata Aster Discovery Platform, along with the extensions for using Hadoop and graph analytics through enhanced SQL, is needed by most businesses.
Teradata is working to demonstrate that the power of this collection of technology can address some of the unrealistic enthusiasm surrounding Hadoop.