Hadoop

 

It happens every few years and it’s happening again. A new technology comes along and a significant segment of the IT and business community want to toss out everything we’ve learned over the past 60 years and start fresh. We “discover” that we’ve been wasting time applying unnecessary rigor and bureaucracy to our projects. No longer should we have to wait three to six months or longer to deliver technical solutions to the business. We can turn these things around in three to six days or even less.

In the mid 1990’s, I was part of a team that developed a “pilot” object-oriented, client-server (remember when these were the hot buzzwords?) application to replenish raw materials for a manufacturing function. We were upending the traditional mainframe world by delivering a solution quickly and iteratively with a small team. When the end users started using the application in real life, it was clear they were going to rely on it to do their jobs every day. Wait, was this a pilot or…? I would come into work in the morning, walk into a special room that housed the application and database servers, check the logs, note any errors, make whatever fixes needed to be made, re-run jobs, and so on.

It wasn’t long before this work began to interfere with my next project, and the end users became frustrated when I wasn’t available to fix problems quickly. It took us a while and several conversations with operations to determine that “production” didn’t just mean “the mainframe”. “Production” meant that people were relying on the solution on a regular basis to do their jobs. So we backtracked and started talking about what kind of availability guarantees we could make, how backup and recovery should work, how we could transition monitoring and maintenance to operations, and so on. In other words, we realized what we needed was a traditional IT project that just happened to leverage newer technologies.

This same scenario is happening today with Hadoop and related tools. When I visit client organizations, a frightening number will have at least one serious person saying something like, “I really don’t think ‘data warehousing’ makes sense any more. It takes too long. We should put all our data in Hadoop and let our end users access whatever they want.” It is indeed a great idea to establish an environment that enables exploration and quick-turnaround analysis against raw data and production data. But to position this approach as a core data and analytics strategy is nothing short of professional malpractice.

The problem is that people are confusing experimentation with IT projects. There is a place for both, and there always has been. Experimentation (or discovery, research, ad-hoc analysis, or whatever term you wish to use) should have lightweight processes and data management practices – it requires prioritization of analysis activity, security and privacy policies and implementation, some understanding of available data, and so on, but it should not be overburdened with the typical rigor required of projects that are building solutions destined for production. Once a prototype is ready to be used on a regular basis for important business functions, that solution should be built through a rigorous IT project leveraging an appropriate – dare I say it – solution development life cycle (SDLC), along with a comprehensive enterprise architecture plan including, yes, a data warehouse that provides integrated, shared, and trusted production data.

An experimental prototype should never be “promoted” to a production environment. That’s what a project is for. Experimentation can be accomplished with Hadoop, relational technology, Microsoft Office, and many other technologies. These same technologies can also be used for production solutions. So, it’s not that “things are done differently and more quickly in Hadoop”. Instead, it’s more appropriate to say that experimentation is different than an IT project, regardless of technology.

Yes, we should do everything we can to reduce unnecessary paperwork and to speed up delivery using proper objective setting, scoping, and agile development techniques. But that is different than abandoning rigor altogether. In fact, using newer technologies in IT projects requires more attention to detail, not less, because we have to take the maturity of the technology into consideration. Can it meet the service level needs of a particular solution? This needs to be asked and examined formally within the project.

Attempting to build production solutions using ad-hoc, experimental data preparation and analysis techniques is like building a modern skyscraper with a grass hut mentality. It just doesn’t make any sense.

Guest Blogger Kevin Lewis is responsible for Teradata’s Strategy and Governance practice. Prior to joining Teradata in 2007, he was responsible for initiating and leading enterprise data management at Publix Super Markets. Since joining Teradata, he has advised dozens of clients in all major industries. 

Taking Charge of Your Data Lake Destiny

Posted on: April 30th, 2014 by Cesar Rojas No Comments

 

One of the most interesting areas of my job is having the opportunity to take an active role in helping shape the future of big data analytics. A significant part of this is the maturation of the open source offerings available to customers and how they can help address today’s analytic conundrums. Customers are constantly looking for new and effective ways to organize their data and they want to build the systems that will empower them to be successful across their organizations. But with the proliferation of data and the rise of data types and analytical models, solving this challenge is becoming increasingly complex.

One of the solutions that has become popular is the concept of a data lake. The idea of a data lake emerged when users were creating new types of data that needed to be captured and exploited across their enterprise. The concept is also tied quite closely to Apache Hadoop and its ecosystem of open source projects so, as you can imagine, since two of my main focus areas (big data analytics and Hadoop) are being brought together, this is an area to which I often pay close attention. Data lakes are designed to tackle some of the emerging big data challenges by offering a new way to organize and build the next generation of systems. They provide a cost effective and technologically refined way to approach and solve big data challenges. However, while data lakes are an important component of the logical data warehouse – because they are designed to give the user choices in order to better manage and utilize data within their analytical ecosystem – many users are also finding that the data lake is also an obvious evolution of their current Apache Hadoop ecosystem and their existing data architecture.

Where do we begin? Quite simply, several questions need to be answered before you start down this path. For instance, it’s important to understand how the data lake is related to your existing enterprise data warehouse, how they work together, and quite possibly the most important question is “What best practices should be leveraged to ensure the resulting strategy drives business value?”

A recent white paper written by CITO Research and sponsored by Teradata and Hortonworks, takes a close look at the data lake and provides answers to all of the above questions, and then some. Without giving away too much of the detail, I thought I would capture a few of the points that impress me most in this paper.

In fact, the data lake has come a long way since its initial entry into the big data scene. Its first iteration included several limitations, making it slightly daunting to general users. The original data lakes were batch-oriented, offering very limited abilities for user interaction with the data, and expertise with MapReduce and other scripting and query capabilities were absolutely necessary for success. Those factors, among others, limited its ability to be widely adopted. Today, however, the landscape is changing. With the arrival of Hadoop 2, and more specifically the release 2.1 of Hortonworks, data lakes are evolving. New Hadoop projects bolstered better resource management and application multi-tenancy allowing multiple workloads on the same cluster that enable users from different business units within organizations to effectively refine, explore, and enrich data. Today, enterprise Hadoop is a full-fledged data lake, with new capabilities being added all the time.

While the capabilities of a data lake evolved over the last few years, so did the world of big data. Companies everywhere started creating data lakes to complement the capabilities of their data warehouses but now must also tackle creating a logical data warehouse in which the data lake and the enterprise data warehouse can be maximized individually -- and yet support each other in the best way possible as well.

The enterprise data warehouse plays a critical role in solving big data challenges, and together with the data lake, the possibilities can deliver real business value. The enterprise data warehouse is a highly designed sophisticated system that provides a single version of the truth that can be used over and over again. And, like a data lake, it supports batch workloads. Unlike a data lake, the enterprise data warehouse also supports simultaneous use by thousands of concurrent users performing reporting and analytic tasks.

There are several impressive uses for a data lake and several beneficial outcomes can result. It is very worthwhile to learn more about data lakes and how they can help you to store and process data at low cost. You can also learn how to create a distributed form of analytics, or learn how the data lake and the enterprise data warehouse have started to work together as a hybrid, unified system that empowers users to ask questions that can be answered by more data and more analytics with less effort. To start learning about these initiatives, download our whitepaper here.

By Cesar Rojas - bio link 

Take a Giant Step with Teradata QueryGrid

Posted on: April 29th, 2014 by Dan Graham No Comments

 

Teradata 15.0 has gotten tremendous interest from customers and the press because it enables SQL access to native JSON data. This heralds the end of the belief that data warehouses can’t handle unstructured data. But there’s an equally momentous innovation in this release called Teradata QueryGrid.

What is Teradata QueryGrid?
In Teradata’s Unified Data Architecture (UDA), there are three primary platforms: the data warehouse, the discovery platform, and the data platform. The huge gray arrows represent data flowing between these systems. A year or two ago, these arrows were extract files moved in batch mode.

Teradata QueryGrid is both a vision and a technology. The vision --simply said-- is that a business person connected to the Teradata Database or Aster Database can submit a single SQL query that joins data together from a second system for analysis. There’s no need to plead with the programmers to extract data and load it into another machine. The business person doesn’t have to care where the data is – they can simply combine relational tables in Teradata with tables or flat files found in Hadoop on demand. Imagine a data scientist working on an Aster discovery problem and needing data from Hadoop. By simply adjusting the queries she is already using, Hadoop data is fetched and combined with tables in the Aster Database. That should be a huge “WOW” all by itself but let’s look further.

You might be saying “That’s not new. We’ve had data virtualization queries for many years.” Teradata QueryGrid is indeed a form of data virtualization. But Teradata QueryGrid doesn’t suffer from the normal limitations of data virtualization such as slow performance, clogged networks, and security concerns.

Today, the vision is translated into reality as connections between Teradata Database and Hadoop as well as Aster Databases and Hadoop. Teradata QueryGrid also connects the Teradata Data Warehouse to Oracle databases. In the near future, it will extend to all combinations of UDA servers such as Teradata to Aster, Aster to Aster, Teradata to Teradata, and so on.

Seven League Boots for SQL
With QueryGrid, you can add a clause in a SQL statement that says “Call up Hadoop, pass Hive a SQL request, receive the Hive results, and join it to the data warehouse tables.” Running a single SQL statement spanning Hadoop and Teradata is amazing in itself – a giant step forward. Notice too that all the database security, advanced SQL functions, and system management in the Teradata or Aster system is supporting these queries. The only effort required is for the database administrator to set up a “view” that connects the systems. It’s self-service for the business user after that. Score: complexity zero, business users one.

Parallel Performance, Performance, Performance
Historically, data virtualization tools lack the ability to move data between systems in parallel. Such tools send a request to a remote database and the data comes back serially through an Ethernet wire. Teradata QueryGrid is built to connect to remote systems in parallel and exchange data through many network connections simultaneously. Wanna move a terabyte per minute? With the right configurations it can be done. Parallel processing by both systems makes this incredibly fast. I know of no data virtualization system that does this today.

Inevitably, the Hadoop cluster will have a different number of servers compared to the Teradata or Aster MPP systems. The Teradata and Aster systems start the parallel data exchange by matching up units of parallelism between the two systems. That is, all the Teradata parallel workers (called AMPs) connect to a buddy Hadoop worker node for maximum throughput. Anytime the configuration changes, the workers match-up changes. This is non-trivial rocket-science class technology. Trust me – you don’t want to do this yourself and the worst situation would be to expose this to the business users. But Teradata QueryGrid does it all for you completely invisible to the user.

Put Data in the Data Lake FAST
Imagine complex predictive analytics using R® or SAS® are run inside the Teradata data warehouse as part of a merger and acquisition project. In this case, we want to pass this data to the Hadoop Data Lake where it is combined with temporary data from the company being acquired. With moderately simple SQL stuffed in a database view, the answers calculated by the Teradata Database can be sent to Hadoop to help finish up some reports. Bi-directional data exchange is another breakthrough in the Teradata Query Grid, new in release 15.0. The common thread in all these innovations is that the data moves from the memory of one system to the memory of the other. No extracts, no landing the data on disk until the final processing step – and sometimes not even then.

Push-down Processing
What we don’t want to do is transfer terabytes of data from Hadoop and throw away 90% of it since it’s not relevant. To minimize data movement, Teradata QueryGrid sends the remote system SQL filters that eliminate records and columns that aren’t needed. An example constraint could be “We only want records for single women age 30-40 with an average account balance over $5000. Oh, and only send us the account number, account type, and address.” This way, the Hadoop system discards unnecessary data so it doesn’t flood the network with data that will be thrown away. After all the processing is done in Hadoop, data is joined in the data warehouse, summarized, and delivered to the user’s favorite business intelligence tool.

Teradata QueryGrid delivers some important benefits:
• It’s easy to use: any user with any BI tool can do it
• Low DBA labor: it’s mostly setting up views and testing them once
• High performance: reducing hours to minutes means more accuracy and faster turnaround for demanding users
• Cross-system data on demand: don’t get stuck in programmer’s work queue
• Teradata/Aster strengths: security, workload management, system management
• Minimum data movement improves performance and reduces network use
• Move the processing to the data

Big data is now taking giant steps through your analytic architecture --frictionless, invisible, and in parallel. Nice boots!

LA kicks off the 2014 Teradata User Group Season

Posted on: April 22nd, 2014 by Guest Blogger No Comments

 

By Rob Armstrong,  Director, Teradata Labs Customer Briefing Team

After presenting for years at the Teradata User Group meetings, it was refreshing to see some changes in this roadshow.  While I had my usual spot on the agenda to present Teradata’s latest database release (15.0), we had some hot new topics including Cloud and Hadoop, some more business level folks were there, more companies researching Teradata’s technology (vs. just current users) and there was a hands-on workshop the following day for the more technically inclined looking to walk through real world Unified Data Architecture™ (UDA) use cases of a Teradata customer.  While LA tends to be a smaller venue than most, the room was packed and we had 40% more attendees compared with last year.

I would be remiss if I did not give a big Thanks to the partner sponsors of the user group meeting.  In LA we had Hortonworks and Dot Hill as our gold and silver sponsors.  I took a few minutes to chat with them and found out some interesting upcoming items.  Most notably, Lisa Sensmeier from Hortonworks talked to me about Hadoop Summit which is coming up in San Jose, June 3-5th.  Jim Jonez, from Dot Hill, gave me the latest on their newest “Ultra Large” disk technology where they’ll have 48 1 TB drives in a single 2U rack.  It is not in the Teradata line up yet, but we are certainly intrigued for the proper use case.

Now, I’d like to take a few minutes to toot my own horn about the Teradata Database 15.0 presentation that had some very exciting elements to help change the way users get to and analyze all of their data.  You may have seen the recent news releases, but if not, here is a quick recap:

  • Teradata 15.0 continues our Unified Data architecture with the new Teradata QueryGrid.  This is the new environment to define and access data from Teradata to other data servers such as Apache Hadoop (Hortonwoks), Teradata Aster Discovery Platform, Oracle, and others.  This lays the foundation for an extension to even more foreign data servers.  15.0 simplifies the whole definition and usage as well as added bi-directional and predicate pushdown.  In a related session, Cesar Rojas provided some good recent examples of customers taking advantage of the entire UDA ecosystem where data from all of the Teradata offerings were used together to generate new actions.
  • The other big news in 15.0 is the inclusion of the JSON data type.  This allows customers to store direct JSON documents in a column and then apply “schema on read” for much greater flexibility with greatly reduced IT resources.  As the JSON document changes, there is no table or database changes necessary to absorb the new content.

Keep your eyes and ears open for the next Teradata User Group event coming your way, or better yet, just go to the webpage: http://www.teradata.com/user-groups/ to see where the bus stops next and to register.  The TUGs are free of charge.  Perhaps we’ll cross paths as I make the circuit? Until then, ‘Keep Calm and Analyze On’ (as the cool kids say).

 Since joining Teradata in 1987, Rob Armstrong has worked in all areas of the data warehousing arena.  He has gone from writing and supporting the database code to implementing and managing systems at some of Teradata’s largest and most innovative customers.  Currently Rob provides sales and marketing support by traveling the globe and evangelizing the Teradata solutions.

 

The best Strata session that I attended was the overview Kurt Brown gave of the Netflix data platform, which contained hype-deflating lessons and many chestnuts of tech advice straight from one of the most intense computing environments on the planet.

Brown, who as a director leads the design and implementation of the data platform, had a cheerful demeanor but demonstrated ruthless judgment and keen insight in his assessment of how various technologies serve the goals of Netflix. It was interesting to me how dedicated he was to both MPP SQL technology and to Apache™ Hadoop.

I attended the session with Daniel Graham, Technical Marketing Specialist of Teradata, who spoke with me afterward about the implications of the Netflix architecture and Brown’s point of view.

SQL Vs Hadoop
Brown rejected the notion that it was possible to build a complete data platform exclusively using either SQL technology or Hadoop alone. In his presentation, Brown explained how Netflix made great use of Hadoop, used Hive for various purposes, and had an eye on Presto, but also couldn’t live without Teradata and Microstrategy as well.

Brown recalled a conversation in which another leader of a data platform explained that he was discarding all his data warehouse technology and going to put everything on Hive. Brown’s response, “Why would you ever want to do that?”

While Brown said he enjoyed the pressure that open source puts on commercial vendors to improve, he was dedicated to using whatever technology could provide answers to questions in the most cost-effective manner. Brown said he was especially pleased that Teradata was going to be able to support a cloud-based implementation that could run at scale. Brown said that Netflix had upwards of 5 petabytes of data in the cloud, all stored on Amazon S3.

After the session, I pointed out to Graham that the pattern in evidence at Netflix and most of the companies who are acknowledged as the leaders in big data, mimics the recommendation of the white paper “Optimize the Value of All Your Enterprise Data” that provides an overview of the Teradata Unified Data Architecture™.

The Unified Data Architecture recommends that that the data that has the most “business value density” be stored in an enterprise data warehouse powered by MPP SQL. This data is used most often by the most users. Hadoop is used as a data refinery to process flat files or NoSQL data in batch mode.

Netflix is a big data companies that arrived at this pattern by adding SQL to a Hadoop infrastructure. Many well-known users of huge MPP SQL installations have added Hadoop.

“Data doesn’t stay unstructured for long. Once you have distilled it, it usually has a structure that is well-represented by flat files,” said Teradata's Graham. “This is the way that the canonical model of most enterprise activity is stored. Then the question is: How you ask questions of that data? There are numerous ways to make this easy for users, but almost all of those ways pump out SQL that then is used to grab the data that is needed.”

Replacing MPP SQL with Hive or Presto is a non-starter because to really support hundreds or thousands of users who are pounding away at a lot of data, you need a way to provide speedy and optimized queries and also to manage the consumption of the shared resources.

“For over 35 years, Teradata has been working on making SQL work at scale for hundreds or thousands of people at a time,” said Graham. “It makes perfect sense to add SQL capability to Hadoop, but it will be a long time, perhaps a decade or more, before you will get the kind of query optimization and performance that Teradata provides. The big data companies use Teradata and other MPP SQL systems because they are the best tool for the job for making huge datasets of high business value density available to an entire company.”

Efforts such as Tez and Impala will clearly move Hive’s capability forward. The question is how far forward and how fast. We will know that victory has been achieved when Netflix, which uses Teradata in a huge cloud implementation, is able to support their analytical workloads with other technology.

Graham predicts that in 5 years, Hadoop will be a good data mart but will still have trouble with complex parallel queries.

“It is common for a product like Microstrategy to pump out SQL statements that may be 10, 20, or even 50 pages long,” said Graham. “When you have 5 tables, the complexity of the queries could be 5 factorial. With 50 tables, that grows to 50 factorial. Handling such queries is a 10- or 20-year journey. Handling them at scale is a feat that many companies can never pull off.”

Graham acknowledges the need for an MPP SQL data warehouse extended to support data discovery, e.g. Teradata Aster Discovery Platform, along with the extensions for using Hadoop and graph analytics through enhanced SQL, is needed by most businesses.

Teradata is working to demonstrate that the power of this collection of technology can address some of the unrealistic enthusiasm surrounding Hadoop.

By: Dan Woods, Forbes Blogger and Co-Founder of Evolved Media

 

In years past, Strata has celebrated the power of raw technology, so it was interesting to note how much the keynotes on Wednesday focused on applications, models, and how to learn and change rather than on speeds and feeds.

After attending the keynotes and some fascinating sessions, it seems clear that the blinders are off. Big data and data science have been proven in practice by many innovators and early adopters. The value of new forms of data and methods of analysis are so well established that there’s no need for exaggerated claims. Hadoop can do so many cool things that it doesn’t have to pretend to do everything, now or in the future. Indeed, the pattern in place at Facebook, Netflix, the Obama Campaign, and many other organizations with muscular data science and engineering departments is that MPP SQL and Hadoop sit side by side, each doing what they do best.

In his excellent session, Kurt Brown, Director, Data Platform at Netflix, recalled someone explaining that his company was discarding its data warehouse and putting everything on Hive. Brown responded, “Why would you want to do that?” What was obvious to Brown, and what he explained at length, is that the most important thing any company can do is assemble technologies and methods that serve its business needs. Brown demonstrated the logic of creating a broad portfolio that serves many different purposes.

Real Value for Real People
The keynotes almost all celebrated applications and models. Vendors didn’t talk about raw power, but about specific use cases and ease-of-use. Farrah Bostic, a marketing and product design consultant, recommended ways to challenge assumptions and create real customer intimacy. This was a key theme: Use the data to understand a person in their terms not yours. Bostic says you will be more successful if you focus on creating value for the real people who are your customers instead of extracting value from some stilted and limited model of a consumer. A skateboarding expert and a sports journalist each explained models and practices for improving performance. This is a long way from the days when a keynote would show a computer chewing through a trillion records.

Geoffrey Moore, the technology and business philosopher, was in true provocative form. He asserted that big data and data science are well on their way to crossing the chasm because so many upstarts pose existential threats to established businesses. This pressure will force big data to cross the chasm and achieve mass adoption. His money quote: "Without big data analytics, companies are blind and deaf, wandering out onto the Web like deer on the freeway.”

An excellent quote to be sure, but it goes too far. Moore would have been more accurate and less sensational if he said, “Without analytics,” not “Without big data analytics.” The reason that MPP SQL and Hadoop have made such a perfect pair is because more than one type of data and method of analysis is needed. Every business needs all the relevant data it can get to understand the people it does business with.

The Differentiator: A Culture of Analytics
The challenge I see companies facing lies in creating a culture of analytics. Tom Davenport has been a leader in promoting analytics as a means to competitive advantage. In his keynote at Strata Rx in September 2013, Davenport stressed the importance of integration.

In his session at Strata this year, Bill Franks, Chief Analytics Officer at Teradata, put it quite simply, "Big data must be an extension of an existing analytics strategy. It is an illusion that big data can make you an analytics company."

When people return from Strata and roll up their sleeves to get to work, I suspect that many will realize that it’s vital to make use of all the data in every way possible. But one person can only do so much. For data to have the biggest impact, people must want to use it. Implementing any type of analytics provides supply. Leadership and culture create demand. Companies like CapitalOne and Netflix don’t do anything without looking at the data.

I wish there were a shortcut to creating a culture of analytics, but there isn’t, and that’s why it’s such a differentiator. Davenport’s writings are probably the best guide, but every company must figure this out based on its unique situation.

Supporting a Culture of Analytics
If you are a CEO, your job is to create a culture of analytics so that you don’t end up like Geoffrey Moore’s deer on the freeway. But if you have Kurt Brown’s job, you must create a way to use all the data you have, to use the sweet spot of each technology to best effect, and to provide data and analytics to everyone who wants them.

At a company like Netflix or Facebook, creating such a data supply chain is a matter of solving many unique problems connected with scale and advanced analytics. But for most companies, common patterns can combine all the modern capabilities into a coherent whole.

I’ve been spending a lot of time with the thought leaders at Teradata lately and closely studying their Unified Data Architecture. Anyone who is seeking to create a comprehensive data and analytics supply chain of the sort in use at leading companies like Netflix should be able to find inspiration in the UDA, as described in a white paper called “Optimizing the Business Value of All Your Enterprise Data.”

The paper does excellent work in creating a framework for data processing and analytics that unifies all the capabilities by describing four use cases: the file system, batch processing, data discovery, and the enterprise data warehouse. Each of these use cases focuses on extracting value from different types of data and serving different types of users. The paper proposes a framework for understanding how each use case creates data with different business value density. The highest volume interaction takes place with data of the highest business value density. For most companies, this is the enterprise data warehouse, which contains a detailed model of all business operations that is used by hundreds or thousands of people. The data discovery platform is used to explore new questions and extend that model. Batch processing and processing of data in a file system extract valuable signals that can be used for discovery and in the model of the business.

While this structure doesn’t map exactly to that of Netflix or Facebook, for most businesses, it supports the most important food groups of data and analytics and shows how they work together.

The refreshing part of Strata this year is that thorny problems of culture and context are starting to take center stage. While Strata will always be chock full of speeds and feeds, it is even more interesting now that new questions are driving the agenda.

By: Dan Woods, Forbes Blogger and Co-Founder of Evolved Media

 

In the Star Trek movies, “the Borg” refers to an alien race that conquers all planets, absorbing the people, technology, and resources into the Borg collective. Even Captain Picard becomes a Borg and chants “We are the Borg. You will be assimilated. Resistance is futile.”

It strikes me that the relational database has behaved similarly since its birth. Over the last thirty years, Teradata and other RDBMS vendors have innovated and modernized, constantly revitalizing what it means to be an RDBMS. But some innovations come from start-up companies that are later assimilated into the RDBMS. And some innovations are reactions to competition. Regardless, many innovations eventually end up in the code base of multiple RDBMS vendor products --with proper respect to patents of course. Here are some examples of cool technologies assimilated into Teradata Database:

• MOLAP cubes storm the market in the late 1990s with Essbase setting the pace and Cognos inventing desktop cubes. MicroStrategy and Teradata team up to build push-down ROLAP SQL into the database for parallel speed. Hyperion Essbase and Teradata also did Hybrid OLAP integration together. Essbase gets acquired, MOLAP cubes fall out of fashion, and in-database ROLAP goes on to provide the best of both worlds as CPUs get faster.

• Early in the 2000s, a startup called Sunopsis shows a distinct advantage of running ELT transformations in-database to get parallel performance with Teradata. ELT takes off in the industry like a rocket. Teradata Labs also collaborates with Informatica to push-down PowerCenter transformation logic into SQL for amazing extract, load, and transform speed. Sunopsis gets acquired. More ETL vendors adopt ELT techniques. Happy DBAs and operations managers meet their nightly batch performance goals. More startups disappear.

• XML and XQuery becomes the rage in the press -- until most every RDBMS adds a data type for XML --plus shred and unshred operators. XML-only database startups are marginalized.

• The uptick of predictive analytics in the market drives collaboration between Teradata and SAS back in 2007. SAS Procs are pushed-down into the database to run massively parallel, opening up tremendous performance benefits for SAS users. This leads to many RDBMS vendors who copy this technique; SAS is in the limelight, and eventually even Hadoop programmers want to run SAS in parallel. Later we see “R,” Fuzzy Logix, and others run in-database too. Sounds like the proverbial win-win to me.

• In-memory technology from QlikView and TIBCO SpotFire excites the market with order-of magnitude performance gains. Several RDBMS vendors then adopt in-memory concepts. But in-memory has limitations on memory size and cost vis-à-vis terabytes of data. Consequently, Teradata introduces Teradata Intelligent Memory that caches hot data automatically in-memory while managing many terabytes of hot and cold data on disk. Two to three percent of the hottest data is managed by data temperature (aka - popular with users), delivering superfast response time. Cool! Or is it hot?

• After reading the Google research paper on MapReduce, a startup called “AsterData” invents SQL-MapReduce (SQL-MR) to add flexible processing to a flexible database engine. This cool innovation causes Teradata to acquire AsterData. Within a year, Aster strikes a nerve across the industry – MapReduce is in-database! This month, Aster earns numerous #1 scores in Ovum’s “Decision Matrix: Selecting an Analytic Database 2013-14” Jan 2014. The race is on for MapReduce in-database!

• The NoSQL community grabs headlines with their unique designs and reliance on JSON data and key-value pairs. MongoDB is hot, using JSON data while CouchBase and Cassandra leverage key-value stores. Teradata promptly decides to add JSON data (unstructured data) to the database and goes the extra mile to put JSONPath syntax into SQL. Teradata also adds the name-value-pair SQL operator (NVP) to extract JSON or key-value store data from weblogs. Schema-on-read technology gets assimilated into the Teradata Database. Java programmers are pleased. Customers make plans. More wins.

--------------------------------------------------------------------------------------------------------

“One trend to watch going forward, in addition to the rise of multi-model NoSQL databases, is the integration of NoSQL concepts into relational databases. One of the methods used in the past by relational database vendors to restrict the adoption of new databases to handle new data formats has been to embrace those formats within the relational database. Two prime examples would be support for XML and object-oriented programming.”
- Matt Aslett, The 451 Group, Next-Generation Operational Databases 2012-2016, Sep 17, 2013

--------------------------------------------------------------------------------------------------------

I’ve had conversations with other industry analysts and they’ve confirmed Matt’s opinion: RDBMS vendors will respond to market trends, innovations, and competitive threats by integrating those technologies into their offering. Unlike the Borg, a lot of these assimilations by RDBMS are friendly collaborations (MicroStrategy, Informatica, SAS, Fuzzy Logix, Revolution R, etc.). Others are just the recognition of new data types that need to be in the database (JSON, XML, BLOBs, geospatial, etc.).

Why is it good to have all these innovations inside the major RDBMS’s? Everyone is having fun right now with their science projects because hype is very high for this startup or that startup or this shiny new thing. But when it comes time to deploy production analytic applications to hundreds or thousands of users, all the “ities” become critical all of a sudden – “ities” that the new kids don’t have and the RDBMS does. “ities” like reliability, recoverability, security, and availability. Companies like Google can bury shiny new 1.oh-my-god quality software in an army of brilliant computer scientists. But Main Street and Wall Street companies cannot.

More important, many people are doing new multi-structured data projects in isolation -- such as weblog analysis, sensor data, graph analysis, or social text analysis. Soon enough they discover the highest value comes from combining that data with all the rest of the data that the organization has collected on customers, inventories, campaigns, financials, etc. Great, I found a new segment of buyer preferences. What does that mean to campaigns, sales, and inventory? Integrating new big data into an RDBMS is a huge win going forward – much better than keeping the different data sets isolated in the basement.

Like this year’s new BMW or Lexus, RDBMS’s modernize, they define modern. But relational database systems don’t grow old, they don’t rust or wear out. RDBMS’s evolve to stay current and constantly introduce new technology.

We are the RDBMS! Technology will be assimilated. Resistance is futile.

Evaluating and Planning for the Real Costs of Big Data

Posted on: January 16th, 2014 by Dan Graham No Comments

 

In a blog I posted in early December, I talked about the total cost of big data. That post, and today’s follow-up post, stem from a webinar that I moderated between Richard Winter, President of Wintercorp, specializing in massive databases, and Bob Page, VP of Products at Hortonworks. During the webinar we discussed how to successfully calibrate and calculate the total cost of data and walked through important lessons related to the costs around running workloads on various platforms including Hadoop. If you haven’t listened to the webinar yet, I recommend you do so.

From the discussion we had during that session and from resulting conversations I have had since, I wanted to address some of the key takeaways we discussed about how to be successful when tackling such a large challenge within your organization. Here are a few key points to consider:

1. Start Small: As Bob Page said, “It’s very easy to dream big and go overboard with these projects, but the key to success is starting small.” Have your first project be a straightforward proof of concept. There are undoubtedly going to be challenges when you are starting your first big data project, but if you can start at a smaller level and build your knowledge and capabilities, your odds of success for the larger projects improve. Don’t make your first venture out of the gate an attempt at a gargatuan project or huge amount of data. When you have some positive results, you will also have the confidence and sanction to build bigger solutions.

2. Address the Entire Scope of Costs: Rather than making the mistake of focusing on upfront purchasing costs only, any total cost of data evaluation must incorporate all possible costs, reflecting an estimate of owning and using data over time for analytic purposes. The framework that Richard developed allows you to do exactly that. It is a framework for estimating the total cost of a big data initiative. During the webinar, Richard discussed the five components of system costs:

  • the hardware acquisition costs
  • the software acquisition costs
  • what you pay for support
  • what you pay for upgrades
  • and what you pay for environmental/infrastructure costs – power and cooling.

According to Richard, we need to estimate the CAPEX and OPEX over five years.  Based on his extensive experience, he also recommends a moderate annual growth assumption of 26 percent in the system capacity. In my experience, most data warehouses double in size every 3 years so Richard is being conservative. Thus the business goal coupled with the CAPEX and OPEX thresholds year by year helps keep the team focused.  For many technical people, the TCOD planning seems like a burden, but it’s actually a career saver. If you are able to control the scope at a relatively low level and can leverage a tool - such as Richard’s framework – you have a higher chance of being successful.

3. Comparison Shop: Executives want to know the total cost of carrying out a large project, whether it is on a data warehouse or Hadoop. Having the ability to compare overall costs between the two systems is important to the overall internal success of the project and to the success of future projects being evaluated as well. Before you can compare anything, it is important to identify a real workload that your business and the executive team can consider funding.  The real workload focuses the comparisons as opposed to generalizations and guesses.  At some point a big data platform selection will generate two analyses you need to work though: 1) what is this workload costing? and 2) which platform can technically accomplish the goals more easily?” Lastly, in a perfect world, the business users should also be able to showcase the business value of the workload.

4. Align Your Stakeholders: Many believe that 60 percent of the work in a project should be in the planning and 40 percent of the execution. In order to evaluate your big data project appropriately, you must incorporate as many variables as possible.  It’s the surprises and stakeholders who weren’t aligned that cause a lot of the big cost over runs. Knowing your assets and stakeholders is key to succeeding. Which is why we recommend using the TCOD framework to get stakeholders to weigh in and achieve alignment on the overall plan. Next, leverage the results as a project plan that you can use toward achieving ROI. By leveraging a framework such as the one that Richard discusses during the webinar, what becomes very clear is that having each assumption, each formula and each of the costs exposed within this framework (in Richard’s there are 60 different costs outlined!), you can identify much more easily where the costs differ and – more importantly – why. The TCOD framework can bring stakeholders into the decision-making process, forming a committed team instead of bystanders and skeptics .

5. Focus on Data Management: One of the things that both of our esteemed webinar guests pointed out is the importance of the number of people and applications accessing big data simultaneously. Data is typically the life-blood of the organization. This includes accessing live information about what is happening now, as well as accurate reporting at the end of the day, month, and quarter. There is a wide spectrum of use cases and each is being used across a wide variety of data types. If you haven’t actually built a 100-terabyte database or distributed file system before, be ready for some painful “character building” surprises. Be ready again at 500TB, at a petabyte, and 5 petabytes. Big data volumes are like the difference between a short weekend hike and making it past base camp on Mount Everest.  Your data management skills will be tested.

During the webinar, our experts all agreed: there is a peaceful coexistence that can happen between Hadoop and the data warehouse. They should be applied to the right workloads and share data as often as possible. When a workload is defined, it becomes clear that some data belongs in the data warehouse while other types of data may be more appropriate in Hadoop. Once you have put your data into its enterprise residence, each will feed their various applications.

In conclusion, being able to leverage a framework, such as the TCOD one that was discussed during the webinar, really lends itself to having a solid plan when approaching your big data challenges and to ultimately solving them.

Here are some additional resources for further information:

Total Cost of Data Webinar

Big Data—What Does It Really Cost? (white paper)

The Real Cost of Big Data (Spreadsheet)

TCOD presentation slides (PDF)

 

The recent webinar by Richard Winter and Bob Page hammered home key lessons about the cost of workloads running on Hadoop and data warehouses.  Richard runs WinterCorp -- a consulting company that has been implementing huge data warehouses for 20+ years.   Bob Page is Vice President of Products for Hortonworks, and before that he was at Yahoo! and eBay running big data projects.  The webinar explored Richard’s cost model for running various workloads on Hadoop and an enterprise data warehouse (EDW).  Richard built the cost model during a consulting engagement with a marketing executive of a large financial services company who was launching a big data initiative.  She had people coming to her saying “you should do it in Hadoop” and others saying “you should do it in the data warehouse.”  Richard’s cost model helped her settle some debates.

The Total Cost of Data analysis results are the basis for the webinar.  What separates Richard’s cost framework from most others is that it includes more than just upfront system costs.  The TCOD cost model also includes five years of programmer labor, data scientist labor, end user labor, maintenance upgrades, plus power and cooling.  Richard said there are 60 costs metrics in the model.  He recommends companies download the TCOD spreadsheet and insert actual local costs since system and labor costs differ by city and country.

For the Hadoop data lake workload (aka. data refinery), labor costs were fairly close between Hadoop and the data warehouse while system costs favored Hadoop.  In the case of the data warehouse workload, the data warehouse system cost was high (remember the power and cooling?) while the Hadoop labor cost structure skyrocketed.  Long story short, Hadoop as a data lake is lower cost than a data warehouse; and the data warehouse is lower cost for complex queries and analytics.

There was general agreement that Hadoop is a cost effective platform for ETL work – the staging of raw data and transforming it into refined value.   But when asked “should we offload ELT/ETL to Hadoop?” Bob Page said:

I think it’s going to be data dependent.  It also depends on what the skills are in the organization.  I experienced it myself when I was running big data platforms.  If there is a successful implementation on the EDW today, there may be a couple reasons why it makes sense to keep it there.  One reason is there may be years and years of business logic encoded, debugged, and vetted.  Moving that to another platform with its inherent differences, you might ask “what’s the value of doing that?” It may take a couple years to get that right and in the end all you have done is migrate to another platform.  I would prefer to invest those resources in adding additional value to the organization rather than moving sideways to another platform.”

 


When the data warehouse workload was costed out, Hadoop’s so called $1000 per terabyte turned out to be an insignificant part of the total.  However, Hadoop’s cost skyrockets because of the need for 1000s of queries being manually coded by high priced Hadoop and moderate priced Java programmers over five years.  The OPEX side of the pie chart was huge when the data warehouse workload was applied to Hadoop.

Richard explained:

The total cost of queries are much lower on the EDW than on Hadoop. SQL is a declarative language – you only have to tell it what you want.  In Hadoop you use a procedural language.  In Hadoop you have to tell the system how to find the data, how to bring it together, and what are the manipulations needed to deliver the results.  With the data warehouse, there is a sophisticated query optimizer that figures all that out automatically for you.  The cost of developing the query on the data warehouse is lower because of the automation provided.”

 

Given the huge costs for Hadoop carrying a data warehouse workload, I asked Bob if he agreed with Richard’s assessment. “Does it pass the sniff test?” I asked. Bob Page replied:

“We don’t see anybody today trying to build an EDW with Hadoop. This is a capability issue not a cost issue. Hadoop is not a data warehouse. Hadoop is not a database. Comparing these two for an EDW workload is comparing apples to oranges. I don’t know anybody who would try to build an EDW in Hadoop. There are many elements of the EDW on the technical side that are well refined and have been for 25 years. Things like workload management, the way concurrency works, and the way security works -- there are many different aspects of a modern EDW that you are not going to see in Hadoop today. I would not see these two as equivalent. So –no– it doesn’t pass the sniff test.”

Bob’s point – in my opinion – is the Hadoop-as-EDW cost model is invalid since Hadoop is not designed to handle EDW workloads.   Richard said he “gave Hadoop the benefit of the doubt” but I suspect the comparison was baked into his consulting contract with the Marketing CMO woman.  Ultimately, Richard and Bob agree from different angles.

There are a lot of press articles and zealots on the web who will argue these results.  But Richard and Bob have the hands-on credentials far beyond most people.  They have worked with dozens of big data implementations from 500TB to 10s of petabytes.  Please spend the time to listen to their webinar for an unbiased view.  The biased view – me – didn’t say all that much during the webinar.

Many CFO’s and CMO’s are grappling with the question “When do we use Hadoop and when should we use the data warehouse?”  Pass them the webinar link, call Richard, or call Bob.

 

Total Cost of Data Webinar

Big Data—What Does It Really Cost? (white paper)

The Real Cost of Big Data (Spreadsheet)

TCOD presentation slides (PDF)

Big Apple Hosts the Final Big Analytics Roadshow of the Year

Posted on: November 26th, 2013 by Teradata Aster No Comments

 

Speaking of ending things on a high note, New York City on December 6th will play host to the final event in the Big Analytics 2013 Roadshow series. Big Analytics 2013 New York is taking place at the Sheraton New York Hotel and Towers in the heart of Midtown on bustling 7th Avenue.

As we reflect on the illustrious journey of the Big Analytics 2013 Roadshow, kicking off in San Francisco, this year the Roadshow traveled through major international destinations including Atlanta, Dallas, Beijing, Tokyo, London and finally culminating at the Big Apple – it truly capsulated the appetite today for collecting, processing, understanding and analyzing data.

Big Analytics Atlanta 2013 photo

Big Analytics Roadshow 2013 stops in Atlanta

Drawing business & technical audiences across the globe, the roadshow afforded the attendees an opportunity to learn more about the convergence of technologies and methods like data science, digital marketing, data warehousing, Hadoop, and discovery platforms. Going beyond the “big data” hype, the event offered learning opportunities on how technologies and ideas combine to drive real business innovation. Our unyielding focus on results from data is truly what made the events so successful.

Continuing on with the rich lineage of delivering quality Big Data information, the New York event promises to pack tremendous amount of Big Data learning & education. The keynotes for the event include such industry luminaries as Dan Vesset, Program VP of Business Analytics at IDC, Tasso Argyros, Senior VP of Big Data at Teradata & Peter Lee, Senior VP of Tibco Software.

Photo of the Teradata Aster team in Dallas

Teradata team at the Dallas Big Analytics Roadshow


The keynotes will be followed by three tracks around Big Data Architecture, Data Science & Discovery & Data Driven Marketing. Each of these tracks will feature industry luminaries like Richard Winter of WinterCorp, John O’Brien of Radiant Advisors & John Lovett of Web Analytics Demystified. They will be joined by vendor presentations from Shaun Connolly of Hortonworks, Todd Talkington of Tableau & Brian Dirking of Alteryx.

As with every Big Analytics event, it presents an exciting opportunity to hear first hand from leading organizations like Comcast, Gilt Groupe & Meredith Corporation on how they are using Big Data Analytics & Discovery to deliver tremendous business value.

In summary, the event promises to be nothing less than the Oscars of Big Data and will bring together the who’s who of the Big Data industry. So, mark your calendars, pack your bags and get ready to attend the biggest Big Data event of the year.