big data

MongoDB and Teradata QueryGrid – Even Better Together

Posted on: June 19th, 2014 by Dan Graham 3 Comments

 

It wasn’t so long ago that NoSQL products were considered competitors with relational databases (RDBMS). Well, for some workloads they still are. But Teradata is an analytic RDBMS which is quite different and complementary to MongoDB. Hence, we are teaming up for the benefit of mutual customers.

The collaboration of MongoDB with Teradata represents a virtuous cycle, a symbiotic exchange of value. This virtuous cycle starts when data is exported from MongoDB to Teradata’s Data Warehouse where it is analyzed and enriched, then sent back to MongoDB to be exploited further. Let me give an example.

An eCommerce retailer builds a website to sell clothing, toys, etc. They use MongoDB because of the flexibility to manage constantly changing web pages, product offers, and marketing campaigns. This front office application exports JSON data to the back-office data warehouse throughout the business day. Automated processes analyze the data and enrich it, calculating next best offers, buyer propensities, consumer profitability scores, inventory depletions, dynamic discounts, and fraud detection. Managers and data scientists also sift through sales results looking for trends and opportunities using dashboards, predictive analytics, visualization, and OLAP. Throughout the day, the data warehouse sends analysis results back to MongoDB where they are used to enhance the visitor experience and improve sales. Then we do it again. It’s a cycle with positive benefits for the front and back office.

Teradata Data Warehouses have been used in this scenario many times with telecommunications, banks, retailers, and other companies. But several things are different working with MongoDB in this scenario. First, MongoDB uses JSON data. This is crucial to frequently changing data formats where new fields are added on a daily basis. Historically, RDBMS’s did not support semi-structured JSON data. Furthermore, the process of changing a database schema to support frequently changing JSON formats took weeks to get through governance committees.

Nowadays, the Teradata Data Warehouse ingests native JSON and accesses it through simple SQL commands. Furthermore, once a field in a table is defined as JSON, the frequently changing JSON structures flow right into the data warehouse without spending weeks in governance committees. Cool! This is a necessary big step forward for the data warehouse. Teradata Data Warehouses can ingest and analyze JSON data easily using any BI tool or ETL tool our customers prefer.

Another difference is that MongoDB is a scale-out system, growing to tens or hundreds of server nodes in a cluster. Hmmm. Teradata systems are also scale-out systems. So how would you exchange data between Teradata Data Warehouse server nodes and MongoDB server nodes? The simple answer is to export JSON to flat files and import them to the other system. Mutual customers are already doing this. Can we do better than import/export? Can we add an interactive dynamic data exchange? Yes, and this is the near term goal of our partnership --connecting Teradata QueryGrid to MongoDB clusters.

Teradata QueryGrid and Mongo DB

Teradata QueryGrid is a capability in the data warehouse that allows a business user to issue requests via popular business intelligence tools such as SAS®, Tableau®, or MicroStrategy®. The user issues a query which runs inside the Teradata Data Warehouse. This query reaches across the network to the MongoDB cluster. JSON data is brought back, joined to relational tables, sorted, summarized, analyzed, and displayed to the business user. All of this is done exceptionally fast and completely invisible to the business user. It’s easy! We like easy.

QueryGrid can also be bi-directional, putting the results of an analysis back into the MongoDB server nodes. The two companies are working on hooking up Teradata QueryGrid right now and we expect to have the solution early in 2015.

The business benefit of connecting Teradata QueryGrid to MongoDB is that data can be exchanged in near real time. That is, a business user can run a query that exchanges data with MongoDB in seconds (or a few minutes if the data volume is huge). This means new promotions and pricing can be deployed from the data warehouse to MongoDB with a few mouse clicks. It means Marketing people can analyze consumer behavior on the retail website throughout the day, making adjustments to increase sales minutes later. And of course, applications with mobile phones, sensors, banking, telecommunications, healthcare and others will get value from this partnership too.

So why does the leading NoSQL vendor partner with the best in class analytic RDBMS? Because they are highly complementary solutions that together provide a virtuous cycle of value to each other. MongoDB and Teradata are already working together well in some sites. And soon we will do even better.

Come visit our Booth at MongoDB World and attend the session “The Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse” Riverside Suite, 3:10 p.m., June 24. You can read more about the partnership between Teradata and MongoDB in this news release issued earlier today. Also, check out the MongoDB blog.

PS: The MongoDB people have been outstanding to work with on all levels. Kudos to Edouard, Max, Sandeep, Rebecca, and others. Great people!

Taking Charge of Your Data Lake Destiny

Posted on: April 30th, 2014 by Cesar Rojas No Comments

 

One of the most interesting areas of my job is having the opportunity to take an active role in helping shape the future of big data analytics. A significant part of this is the maturation of the open source offerings available to customers and how they can help address today’s analytic conundrums. Customers are constantly looking for new and effective ways to organize their data and they want to build the systems that will empower them to be successful across their organizations. But with the proliferation of data and the rise of data types and analytical models, solving this challenge is becoming increasingly complex.

One of the solutions that has become popular is the concept of a data lake. The idea of a data lake emerged when users were creating new types of data that needed to be captured and exploited across their enterprise. The concept is also tied quite closely to Apache Hadoop and its ecosystem of open source projects so, as you can imagine, since two of my main focus areas (big data analytics and Hadoop) are being brought together, this is an area to which I often pay close attention. Data lakes are designed to tackle some of the emerging big data challenges by offering a new way to organize and build the next generation of systems. They provide a cost effective and technologically refined way to approach and solve big data challenges. However, while data lakes are an important component of the logical data warehouse – because they are designed to give the user choices in order to better manage and utilize data within their analytical ecosystem – many users are also finding that the data lake is also an obvious evolution of their current Apache Hadoop ecosystem and their existing data architecture.

Where do we begin? Quite simply, several questions need to be answered before you start down this path. For instance, it’s important to understand how the data lake is related to your existing enterprise data warehouse, how they work together, and quite possibly the most important question is “What best practices should be leveraged to ensure the resulting strategy drives business value?”

A recent white paper written by CITO Research and sponsored by Teradata and Hortonworks, takes a close look at the data lake and provides answers to all of the above questions, and then some. Without giving away too much of the detail, I thought I would capture a few of the points that impress me most in this paper.

In fact, the data lake has come a long way since its initial entry into the big data scene. Its first iteration included several limitations, making it slightly daunting to general users. The original data lakes were batch-oriented, offering very limited abilities for user interaction with the data, and expertise with MapReduce and other scripting and query capabilities were absolutely necessary for success. Those factors, among others, limited its ability to be widely adopted. Today, however, the landscape is changing. With the arrival of Hadoop 2, and more specifically the release 2.1 of Hortonworks, data lakes are evolving. New Hadoop projects bolstered better resource management and application multi-tenancy allowing multiple workloads on the same cluster that enable users from different business units within organizations to effectively refine, explore, and enrich data. Today, enterprise Hadoop is a full-fledged data lake, with new capabilities being added all the time.

While the capabilities of a data lake evolved over the last few years, so did the world of big data. Companies everywhere started creating data lakes to complement the capabilities of their data warehouses but now must also tackle creating a logical data warehouse in which the data lake and the enterprise data warehouse can be maximized individually -- and yet support each other in the best way possible as well.

The enterprise data warehouse plays a critical role in solving big data challenges, and together with the data lake, the possibilities can deliver real business value. The enterprise data warehouse is a highly designed sophisticated system that provides a single version of the truth that can be used over and over again. And, like a data lake, it supports batch workloads. Unlike a data lake, the enterprise data warehouse also supports simultaneous use by thousands of concurrent users performing reporting and analytic tasks.

There are several impressive uses for a data lake and several beneficial outcomes can result. It is very worthwhile to learn more about data lakes and how they can help you to store and process data at low cost. You can also learn how to create a distributed form of analytics, or learn how the data lake and the enterprise data warehouse have started to work together as a hybrid, unified system that empowers users to ask questions that can be answered by more data and more analytics with less effort. To start learning about these initiatives, download our whitepaper here.

By Cesar Rojas - bio link 

Change and “Ah-Ha Moments”

Posted on: March 31st, 2014 by Ray Wilson No Comments

 

This is the first in a series of articles discussing the inherent nature of change and some useful suggestions for helping operationalize those “ah-ha moments."

Nobody has ever said that change is easy. It is a journey full of obstacles. But those obstacles are not impenetrable and with the right planning and communication, many of these obstacles can be cleared away making a more defined path for change to follow.   

So why is it that we often see failures that could have been avoided if changes that are obvious were not addressed before the problem occurred? The data was analyzed and yet nobody acted on these insights. Why does the organization fail to what I call operationalize the ah-ha moment? Was it a conscious decision? 

From the outside looking in it is easy to criticize organizations for not implementing obvious changes. But from the inside, there are many issues that cripple the efforts of change, and it usually boils down to time, people, process, technology or financial challenges.  

Companies make significant investments in business intelligence capabilities because they realize that hidden within the vast amounts of information they generate on a daily basis, there are jewels to be found that can provide valuable insights for the entire organization. For example, with today's analytic platforms business analysts in the marketing department have access to sophisticated tools that can mine information and uncover reasons for the high rate of churn occurring in their customer base. They might do this by analyzing all interactions and conversations taking place across the enterprise and the channels where customers engage the company. Using this data analysts then begin to  see various paths and patterns emerging from these interactions that ultimately lead to customer churn.   

These analysts have just discovered the leading causes of churn within their organization and are at the apex of the ah-ha moment. They now have the insights to stop the mass exodus of valuable customers and positively impact the bottom line. It’s obvious these insights would be acted upon and operationalized immediately, but that may not be the case. Perhaps the recently discovered patterns leading to customer churn touch so many internal systems, processes and organizations that getting organizational buy in to the necessary changes is mired down in a endless series of internal meetings.   

So what can be done given these realities? Here’s a quick list of tips that will help you enable change in your organization:

  • Someone needs to own the change and then lead rather than letting change lead him or her.
  • Make sure the reasons for change are well documented including measurable impacts and benefits for the organization.
  • When building a change management plan, identify the obstacles in the organization and make sure to build a mitigation plan for each.
    Communicate the needed changes through several channels.
  • Be clear when communicating change. Rumors can quickly derail or stall well thought out and planned change efforts.
  • When implementing changes make sure that the change is ready to be implemented and is fully tested.
  • Communicate the impact of the changes that have been deployed.  
  • Have enthusiastic people on the team and train them to be agents of change.
  • Establish credibility by building a proven track record that will give management the confidence that the team has the skills, creativity and discipline to implement these complex changes. 

Once implemented monitor the changes closely and anticipate that some changes will require further refinement. Remember that operationalizing the ah-ha moment is a journey.  A journey that can bring many valuable and rewarding benefits along the way. 

So, what’s your experience with operationalizing your "ah-ha moment"?

 

In years past, Strata has celebrated the power of raw technology, so it was interesting to note how much the keynotes on Wednesday focused on applications, models, and how to learn and change rather than on speeds and feeds.

After attending the keynotes and some fascinating sessions, it seems clear that the blinders are off. Big data and data science have been proven in practice by many innovators and early adopters. The value of new forms of data and methods of analysis are so well established that there’s no need for exaggerated claims. Hadoop can do so many cool things that it doesn’t have to pretend to do everything, now or in the future. Indeed, the pattern in place at Facebook, Netflix, the Obama Campaign, and many other organizations with muscular data science and engineering departments is that MPP SQL and Hadoop sit side by side, each doing what they do best.

In his excellent session, Kurt Brown, Director, Data Platform at Netflix, recalled someone explaining that his company was discarding its data warehouse and putting everything on Hive. Brown responded, “Why would you want to do that?” What was obvious to Brown, and what he explained at length, is that the most important thing any company can do is assemble technologies and methods that serve its business needs. Brown demonstrated the logic of creating a broad portfolio that serves many different purposes.

Real Value for Real People
The keynotes almost all celebrated applications and models. Vendors didn’t talk about raw power, but about specific use cases and ease-of-use. Farrah Bostic, a marketing and product design consultant, recommended ways to challenge assumptions and create real customer intimacy. This was a key theme: Use the data to understand a person in their terms not yours. Bostic says you will be more successful if you focus on creating value for the real people who are your customers instead of extracting value from some stilted and limited model of a consumer. A skateboarding expert and a sports journalist each explained models and practices for improving performance. This is a long way from the days when a keynote would show a computer chewing through a trillion records.

Geoffrey Moore, the technology and business philosopher, was in true provocative form. He asserted that big data and data science are well on their way to crossing the chasm because so many upstarts pose existential threats to established businesses. This pressure will force big data to cross the chasm and achieve mass adoption. His money quote: "Without big data analytics, companies are blind and deaf, wandering out onto the Web like deer on the freeway.”

An excellent quote to be sure, but it goes too far. Moore would have been more accurate and less sensational if he said, “Without analytics,” not “Without big data analytics.” The reason that MPP SQL and Hadoop have made such a perfect pair is because more than one type of data and method of analysis is needed. Every business needs all the relevant data it can get to understand the people it does business with.

The Differentiator: A Culture of Analytics
The challenge I see companies facing lies in creating a culture of analytics. Tom Davenport has been a leader in promoting analytics as a means to competitive advantage. In his keynote at Strata Rx in September 2013, Davenport stressed the importance of integration.

In his session at Strata this year, Bill Franks, Chief Analytics Officer at Teradata, put it quite simply, "Big data must be an extension of an existing analytics strategy. It is an illusion that big data can make you an analytics company."

When people return from Strata and roll up their sleeves to get to work, I suspect that many will realize that it’s vital to make use of all the data in every way possible. But one person can only do so much. For data to have the biggest impact, people must want to use it. Implementing any type of analytics provides supply. Leadership and culture create demand. Companies like CapitalOne and Netflix don’t do anything without looking at the data.

I wish there were a shortcut to creating a culture of analytics, but there isn’t, and that’s why it’s such a differentiator. Davenport’s writings are probably the best guide, but every company must figure this out based on its unique situation.

Supporting a Culture of Analytics
If you are a CEO, your job is to create a culture of analytics so that you don’t end up like Geoffrey Moore’s deer on the freeway. But if you have Kurt Brown’s job, you must create a way to use all the data you have, to use the sweet spot of each technology to best effect, and to provide data and analytics to everyone who wants them.

At a company like Netflix or Facebook, creating such a data supply chain is a matter of solving many unique problems connected with scale and advanced analytics. But for most companies, common patterns can combine all the modern capabilities into a coherent whole.

I’ve been spending a lot of time with the thought leaders at Teradata lately and closely studying their Unified Data Architecture. Anyone who is seeking to create a comprehensive data and analytics supply chain of the sort in use at leading companies like Netflix should be able to find inspiration in the UDA, as described in a white paper called “Optimizing the Business Value of All Your Enterprise Data.”

The paper does excellent work in creating a framework for data processing and analytics that unifies all the capabilities by describing four use cases: the file system, batch processing, data discovery, and the enterprise data warehouse. Each of these use cases focuses on extracting value from different types of data and serving different types of users. The paper proposes a framework for understanding how each use case creates data with different business value density. The highest volume interaction takes place with data of the highest business value density. For most companies, this is the enterprise data warehouse, which contains a detailed model of all business operations that is used by hundreds or thousands of people. The data discovery platform is used to explore new questions and extend that model. Batch processing and processing of data in a file system extract valuable signals that can be used for discovery and in the model of the business.

While this structure doesn’t map exactly to that of Netflix or Facebook, for most businesses, it supports the most important food groups of data and analytics and shows how they work together.

The refreshing part of Strata this year is that thorny problems of culture and context are starting to take center stage. While Strata will always be chock full of speeds and feeds, it is even more interesting now that new questions are driving the agenda.

By: Dan Woods, Forbes Blogger and Co-Founder of Evolved Media

Data-Driven Business and the Need for Speed

Posted on: February 14th, 2014 by Guest Blogger No Comments

 

What stands in the way of effective use of data? Let’s face it: it’s people as much as technology. Silos of data and expertise cause friction in many companies. To create a winning strategy for data-driven business, you need fast, effective collaboration between teams, something much more like a pit crew than an org chart. To gain speed, you must find a way to break the silos between groups.

In most organizations, multiple teams are involved with data. Data may be used, passed onto another stage as an input, or both. Each use case involves end users, user interface designers, data scientists, business analysts, programmers, and operations people.

For BI to work optimally it must break through at least three silos:

  • Silo 1: BI architects
  • Silo 2: Data analysts
  • Silo 3: Business users

First, those focusing on data prep, programming and advanced analysis must not only work together but must be directed by the needs of the business users, ideally with a tight feedback loop. You don’t just want to get something in the hands of employees; you want to get the right thing in front of each of them, and that means communicating and understanding their needs.

Other types of silos exist between lines of business. For an insight to have maximum impact, it must find its way to those who need it most. Everyone must be on the lookout for signals that can be used by other lines of business and pass them along.

Silos Smashed: An End-to-End View
There’s no way to achieve an end-to-end view without breaking down silos. Usage statistics can show what types of analysis are popular, which increases transparency. Cross-functional teams that include all stakeholders (BI architects, data analysts, and business users) also help in breaking down barriers.

With the silos smashed, you can acquire value from data faster. Speed is key and breaking down silos can do for BI what pit crews do for racing. Data-driven business moves fast, and we must find more efficient ways of working together, cutting time off each iteration as we move toward real-time delivery of the right data to the right person at the right time.

For example, we should be able to see how changing a price affects customers and impacts the bottom line. And we should be able to do that on a store manager’s mobile device in real-time, not waiting until the back office runs the report and comes down from Mount Olympus with the answer. Immediacy of data for decision-making--that’s what drives competitive advantage.

While I'm at Strata, I'll be looking for new ideas about how to break the silos and speed time to value from data. I’m interested to hear your thoughts as well.

By: Dan Woods, Forbes Blogger and Co-Founder of Evolved Media

To learn more:

Optimize the Value of All the Data white paper
The Intelligent Enterprise infographic

Open Your Mind to all the Data

Posted on: February 14th, 2014 by Guest Blogger No Comments

 

In technology discussions, we often forget who the big winners are. When you look at almost any technology market, if a vendor makes a certain amount, the systems integrator may make 4 or 5 times that amount selling services to configure and adapt the solution. Of course, the biggest winner is the company using the technology, with a return of 10X, 20X, or even 100X. This year, I’m attending Strata and trying to understand how to hunt down and capture that 100X return.

I’m not alone in this quest. In the past Strata was mostly a technology conference, focused on vendor or systems integrator return. But this year, it is clear that the organizers realized they must find a way to focus on the 100X return, as evidenced by the Data-driven Business track. If Strata becomes an event where people can identify and celebrate the pathway to the 100X return, attendance will skyrocket.

Most of the time the 100X return comes from using data that only applies to a specific business. If you realize those kind of returns, are you really going to come to Strata to talk about it? I don’t think so. So we won’t find sessions that provide a complete answer for your business. You will only get parts. The challenge when attending a conference like Strata is how to find all the parts, put them together, and come home with ideas for getting that 100X return.

My answer is to focus on questions, then on data, and then on technology that can find answers to the questions. There are many ways to run this playbook, but here’s one way to make it work.

Questions Achieve Focus
The problem that anyone attending Strata or any large conference faces is the huge array of vendors and sessions. One way to guide exploration is through curiosity. This maintains enthusiasm during the conference but may not give you a 100X idea. Another way is to begin with a predetermined problem in mind. This is a good approach, but it may cut off avenues that lead to a 100X result.

Remember: 100X results are usually found through experimentation. Everything with an obvious huge return is usually already being done. You have to find a way to be open to new ideas that are focused on business value. Here’s one way I’ve thought of, called the Question Game.

The Question Game is a method for aligning technology with the needs of a business. It is also a great way to organize a trip to a conference like Strata, where the ratio of information to marketing spin is high.

I came up with the Question Game while reading Walter Isaacson’s biography of Steve Jobs. Two things struck me. First was the way that Jobs and Tim Cook cared a lot about focus. Jobs held an annual offsite attended by about 100 of Apple’s top people. At the end of the session, Jobs would write up the 10 most important initiatives and then cross out 7 so the team could focus on executing just 3. Both Jobs and Cook were proud of saying no to ideas that detracted from their focus.

Here’s how Tim Cook describes it: “We are the most focused company that I know of, or have read of, or have any knowledge of. We say no to good ideas every day. We say no to great ideas in order to keep the amount of things we focus on very small in number, so that we can put enormous energy behind the ones we do choose, so that we can deliver the best products in the world.”

Because I focus most of my time figuring out how technology leaders can make magic in business, I was eager to find away to empower people to say no. At the same time, I wanted room for invention and innovation. Is it possible to explore all the technology and data out there in an efficient way focused on business needs? Yes, using the Question Game. Here’s how it works:

  • The CIO, CTO, or whoever is leading innovation surveys each business leader, asking for questions they would like to have answered and the business value of answering them
  • Questions are then ranked by value, whether monetary or otherwise

The Question Game provides a clear way for the business to express its desires. With these questions in mind, Strata becomes a hunting ground far more likely to yield a path to 100X ideas. To prepare for this hunt, list all the questions and look at the agenda to find relevant sessions.

It’s the Data, All the Data
With your questions in hand, keep reminding yourself that all the technology in the world won’t answer the questions on your list. Usually, your most valuable questions will be answered by acquiring or analyzing data. The most important thing you can do is look for sources of data to answer the high value questions.

The first place to look for data is inside your company. In most companies, it is not easy to find all the available data and even harder to find additional data that could be available with a modest effort.

Too often the search for data stops inside the company. As I pointed out in “Do You Suffer from the Data Not Invented Here Syndrome?” many companies have blinders to external data sources, which can be an excellent way to shed light on high value questions. Someone should be looking for external data as well.

Remember, it is vital to keep an open mind. The important thing is to find data that can answer high value questions, not to find any particular type of data. Strata is an excellent place to do this. Some sessions shed light on particular types of data, but I’ve found that working the room, asking people what kinds of data they use, and showing them your questions can lead to great ideas.

Finding Relevant Technology
Once you have an idea about the important questions and the relevant data, you will be empowered to focus. When you tour the vendors, you can say no with confidence when a technology doesn’t have a hope of processing the relevant data to answer a high value question. Of course, it is impossible to tell from a session description or a vendor’s website if they will be relevant to a specific question. But by having the questions in mind and showing them to people you talk to, you will greatly speed the path to finding what you need. When others see your questions, they will have suggestions you didn’t expect.

Will this method guarantee you a 100X solution every time you attend Strata? I wouldn’t make that claim. But by following this plan, you have a much better chance for a victory.

While I'm at Strata, I'll be playing the question game so I can quickly and effectively learn a huge amount about the fastest path to a data-driven business.

By: Dan Woods, Forbes Blogger and Co-Founder of Evolved Media

To Learn More:

Optimize the Business Value of All Your Enterprise Data (white paper)
Bring Dark Data into the Light (infographic)
Benefit from Any and All Data (article)
To Succeed with Big Data, Begin with the Decision in Mind (article)

 

In the Star Trek movies, “the Borg” refers to an alien race that conquers all planets, absorbing the people, technology, and resources into the Borg collective. Even Captain Picard becomes a Borg and chants “We are the Borg. You will be assimilated. Resistance is futile.”

It strikes me that the relational database has behaved similarly since its birth. Over the last thirty years, Teradata and other RDBMS vendors have innovated and modernized, constantly revitalizing what it means to be an RDBMS. But some innovations come from start-up companies that are later assimilated into the RDBMS. And some innovations are reactions to competition. Regardless, many innovations eventually end up in the code base of multiple RDBMS vendor products --with proper respect to patents of course. Here are some examples of cool technologies assimilated into Teradata Database:

• MOLAP cubes storm the market in the late 1990s with Essbase setting the pace and Cognos inventing desktop cubes. MicroStrategy and Teradata team up to build push-down ROLAP SQL into the database for parallel speed. Hyperion Essbase and Teradata also did Hybrid OLAP integration together. Essbase gets acquired, MOLAP cubes fall out of fashion, and in-database ROLAP goes on to provide the best of both worlds as CPUs get faster.

• Early in the 2000s, a startup called Sunopsis shows a distinct advantage of running ELT transformations in-database to get parallel performance with Teradata. ELT takes off in the industry like a rocket. Teradata Labs also collaborates with Informatica to push-down PowerCenter transformation logic into SQL for amazing extract, load, and transform speed. Sunopsis gets acquired. More ETL vendors adopt ELT techniques. Happy DBAs and operations managers meet their nightly batch performance goals. More startups disappear.

• XML and XQuery becomes the rage in the press -- until most every RDBMS adds a data type for XML --plus shred and unshred operators. XML-only database startups are marginalized.

• The uptick of predictive analytics in the market drives collaboration between Teradata and SAS back in 2007. SAS Procs are pushed-down into the database to run massively parallel, opening up tremendous performance benefits for SAS users. This leads to many RDBMS vendors who copy this technique; SAS is in the limelight, and eventually even Hadoop programmers want to run SAS in parallel. Later we see “R,” Fuzzy Logix, and others run in-database too. Sounds like the proverbial win-win to me.

• In-memory technology from QlikView and TIBCO SpotFire excites the market with order-of magnitude performance gains. Several RDBMS vendors then adopt in-memory concepts. But in-memory has limitations on memory size and cost vis-à-vis terabytes of data. Consequently, Teradata introduces Teradata Intelligent Memory that caches hot data automatically in-memory while managing many terabytes of hot and cold data on disk. Two to three percent of the hottest data is managed by data temperature (aka - popular with users), delivering superfast response time. Cool! Or is it hot?

• After reading the Google research paper on MapReduce, a startup called “AsterData” invents SQL-MapReduce (SQL-MR) to add flexible processing to a flexible database engine. This cool innovation causes Teradata to acquire AsterData. Within a year, Aster strikes a nerve across the industry – MapReduce is in-database! This month, Aster earns numerous #1 scores in Ovum’s “Decision Matrix: Selecting an Analytic Database 2013-14” Jan 2014. The race is on for MapReduce in-database!

• The NoSQL community grabs headlines with their unique designs and reliance on JSON data and key-value pairs. MongoDB is hot, using JSON data while CouchBase and Cassandra leverage key-value stores. Teradata promptly decides to add JSON data (unstructured data) to the database and goes the extra mile to put JSONPath syntax into SQL. Teradata also adds the name-value-pair SQL operator (NVP) to extract JSON or key-value store data from weblogs. Schema-on-read technology gets assimilated into the Teradata Database. Java programmers are pleased. Customers make plans. More wins.

--------------------------------------------------------------------------------------------------------

“One trend to watch going forward, in addition to the rise of multi-model NoSQL databases, is the integration of NoSQL concepts into relational databases. One of the methods used in the past by relational database vendors to restrict the adoption of new databases to handle new data formats has been to embrace those formats within the relational database. Two prime examples would be support for XML and object-oriented programming.”
- Matt Aslett, The 451 Group, Next-Generation Operational Databases 2012-2016, Sep 17, 2013

--------------------------------------------------------------------------------------------------------

I’ve had conversations with other industry analysts and they’ve confirmed Matt’s opinion: RDBMS vendors will respond to market trends, innovations, and competitive threats by integrating those technologies into their offering. Unlike the Borg, a lot of these assimilations by RDBMS are friendly collaborations (MicroStrategy, Informatica, SAS, Fuzzy Logix, Revolution R, etc.). Others are just the recognition of new data types that need to be in the database (JSON, XML, BLOBs, geospatial, etc.).

Why is it good to have all these innovations inside the major RDBMS’s? Everyone is having fun right now with their science projects because hype is very high for this startup or that startup or this shiny new thing. But when it comes time to deploy production analytic applications to hundreds or thousands of users, all the “ities” become critical all of a sudden – “ities” that the new kids don’t have and the RDBMS does. “ities” like reliability, recoverability, security, and availability. Companies like Google can bury shiny new 1.oh-my-god quality software in an army of brilliant computer scientists. But Main Street and Wall Street companies cannot.

More important, many people are doing new multi-structured data projects in isolation -- such as weblog analysis, sensor data, graph analysis, or social text analysis. Soon enough they discover the highest value comes from combining that data with all the rest of the data that the organization has collected on customers, inventories, campaigns, financials, etc. Great, I found a new segment of buyer preferences. What does that mean to campaigns, sales, and inventory? Integrating new big data into an RDBMS is a huge win going forward – much better than keeping the different data sets isolated in the basement.

Like this year’s new BMW or Lexus, RDBMS’s modernize, they define modern. But relational database systems don’t grow old, they don’t rust or wear out. RDBMS’s evolve to stay current and constantly introduce new technology.

We are the RDBMS! Technology will be assimilated. Resistance is futile.

Evaluating and Planning for the Real Costs of Big Data

Posted on: January 16th, 2014 by Dan Graham No Comments

 

In a blog I posted in early December, I talked about the total cost of big data. That post, and today’s follow-up post, stem from a webinar that I moderated between Richard Winter, President of Wintercorp, specializing in massive databases, and Bob Page, VP of Products at Hortonworks. During the webinar we discussed how to successfully calibrate and calculate the total cost of data and walked through important lessons related to the costs around running workloads on various platforms including Hadoop. If you haven’t listened to the webinar yet, I recommend you do so.

From the discussion we had during that session and from resulting conversations I have had since, I wanted to address some of the key takeaways we discussed about how to be successful when tackling such a large challenge within your organization. Here are a few key points to consider:

1. Start Small: As Bob Page said, “It’s very easy to dream big and go overboard with these projects, but the key to success is starting small.” Have your first project be a straightforward proof of concept. There are undoubtedly going to be challenges when you are starting your first big data project, but if you can start at a smaller level and build your knowledge and capabilities, your odds of success for the larger projects improve. Don’t make your first venture out of the gate an attempt at a gargatuan project or huge amount of data. When you have some positive results, you will also have the confidence and sanction to build bigger solutions.

2. Address the Entire Scope of Costs: Rather than making the mistake of focusing on upfront purchasing costs only, any total cost of data evaluation must incorporate all possible costs, reflecting an estimate of owning and using data over time for analytic purposes. The framework that Richard developed allows you to do exactly that. It is a framework for estimating the total cost of a big data initiative. During the webinar, Richard discussed the five components of system costs:

  • the hardware acquisition costs
  • the software acquisition costs
  • what you pay for support
  • what you pay for upgrades
  • and what you pay for environmental/infrastructure costs – power and cooling.

According to Richard, we need to estimate the CAPEX and OPEX over five years.  Based on his extensive experience, he also recommends a moderate annual growth assumption of 26 percent in the system capacity. In my experience, most data warehouses double in size every 3 years so Richard is being conservative. Thus the business goal coupled with the CAPEX and OPEX thresholds year by year helps keep the team focused.  For many technical people, the TCOD planning seems like a burden, but it’s actually a career saver. If you are able to control the scope at a relatively low level and can leverage a tool - such as Richard’s framework – you have a higher chance of being successful.

3. Comparison Shop: Executives want to know the total cost of carrying out a large project, whether it is on a data warehouse or Hadoop. Having the ability to compare overall costs between the two systems is important to the overall internal success of the project and to the success of future projects being evaluated as well. Before you can compare anything, it is important to identify a real workload that your business and the executive team can consider funding.  The real workload focuses the comparisons as opposed to generalizations and guesses.  At some point a big data platform selection will generate two analyses you need to work though: 1) what is this workload costing? and 2) which platform can technically accomplish the goals more easily?” Lastly, in a perfect world, the business users should also be able to showcase the business value of the workload.

4. Align Your Stakeholders: Many believe that 60 percent of the work in a project should be in the planning and 40 percent of the execution. In order to evaluate your big data project appropriately, you must incorporate as many variables as possible.  It’s the surprises and stakeholders who weren’t aligned that cause a lot of the big cost over runs. Knowing your assets and stakeholders is key to succeeding. Which is why we recommend using the TCOD framework to get stakeholders to weigh in and achieve alignment on the overall plan. Next, leverage the results as a project plan that you can use toward achieving ROI. By leveraging a framework such as the one that Richard discusses during the webinar, what becomes very clear is that having each assumption, each formula and each of the costs exposed within this framework (in Richard’s there are 60 different costs outlined!), you can identify much more easily where the costs differ and – more importantly – why. The TCOD framework can bring stakeholders into the decision-making process, forming a committed team instead of bystanders and skeptics .

5. Focus on Data Management: One of the things that both of our esteemed webinar guests pointed out is the importance of the number of people and applications accessing big data simultaneously. Data is typically the life-blood of the organization. This includes accessing live information about what is happening now, as well as accurate reporting at the end of the day, month, and quarter. There is a wide spectrum of use cases and each is being used across a wide variety of data types. If you haven’t actually built a 100-terabyte database or distributed file system before, be ready for some painful “character building” surprises. Be ready again at 500TB, at a petabyte, and 5 petabytes. Big data volumes are like the difference between a short weekend hike and making it past base camp on Mount Everest.  Your data management skills will be tested.

During the webinar, our experts all agreed: there is a peaceful coexistence that can happen between Hadoop and the data warehouse. They should be applied to the right workloads and share data as often as possible. When a workload is defined, it becomes clear that some data belongs in the data warehouse while other types of data may be more appropriate in Hadoop. Once you have put your data into its enterprise residence, each will feed their various applications.

In conclusion, being able to leverage a framework, such as the TCOD one that was discussed during the webinar, really lends itself to having a solid plan when approaching your big data challenges and to ultimately solving them.

Here are some additional resources for further information:

Total Cost of Data Webinar

Big Data—What Does It Really Cost? (white paper)

The Real Cost of Big Data (Spreadsheet)

TCOD presentation slides (PDF)

 

The recent webinar by Richard Winter and Bob Page hammered home key lessons about the cost of workloads running on Hadoop and data warehouses.  Richard runs WinterCorp -- a consulting company that has been implementing huge data warehouses for 20+ years.   Bob Page is Vice President of Products for Hortonworks, and before that he was at Yahoo! and eBay running big data projects.  The webinar explored Richard’s cost model for running various workloads on Hadoop and an enterprise data warehouse (EDW).  Richard built the cost model during a consulting engagement with a marketing executive of a large financial services company who was launching a big data initiative.  She had people coming to her saying “you should do it in Hadoop” and others saying “you should do it in the data warehouse.”  Richard’s cost model helped her settle some debates.

The Total Cost of Data analysis results are the basis for the webinar.  What separates Richard’s cost framework from most others is that it includes more than just upfront system costs.  The TCOD cost model also includes five years of programmer labor, data scientist labor, end user labor, maintenance upgrades, plus power and cooling.  Richard said there are 60 costs metrics in the model.  He recommends companies download the TCOD spreadsheet and insert actual local costs since system and labor costs differ by city and country.

For the Hadoop data lake workload (aka. data refinery), labor costs were fairly close between Hadoop and the data warehouse while system costs favored Hadoop.  In the case of the data warehouse workload, the data warehouse system cost was high (remember the power and cooling?) while the Hadoop labor cost structure skyrocketed.  Long story short, Hadoop as a data lake is lower cost than a data warehouse; and the data warehouse is lower cost for complex queries and analytics.

There was general agreement that Hadoop is a cost effective platform for ETL work – the staging of raw data and transforming it into refined value.   But when asked “should we offload ELT/ETL to Hadoop?” Bob Page said:

I think it’s going to be data dependent.  It also depends on what the skills are in the organization.  I experienced it myself when I was running big data platforms.  If there is a successful implementation on the EDW today, there may be a couple reasons why it makes sense to keep it there.  One reason is there may be years and years of business logic encoded, debugged, and vetted.  Moving that to another platform with its inherent differences, you might ask “what’s the value of doing that?” It may take a couple years to get that right and in the end all you have done is migrate to another platform.  I would prefer to invest those resources in adding additional value to the organization rather than moving sideways to another platform.”

 


When the data warehouse workload was costed out, Hadoop’s so called $1000 per terabyte turned out to be an insignificant part of the total.  However, Hadoop’s cost skyrockets because of the need for 1000s of queries being manually coded by high priced Hadoop and moderate priced Java programmers over five years.  The OPEX side of the pie chart was huge when the data warehouse workload was applied to Hadoop.

Richard explained:

The total cost of queries are much lower on the EDW than on Hadoop. SQL is a declarative language – you only have to tell it what you want.  In Hadoop you use a procedural language.  In Hadoop you have to tell the system how to find the data, how to bring it together, and what are the manipulations needed to deliver the results.  With the data warehouse, there is a sophisticated query optimizer that figures all that out automatically for you.  The cost of developing the query on the data warehouse is lower because of the automation provided.”

 

Given the huge costs for Hadoop carrying a data warehouse workload, I asked Bob if he agreed with Richard’s assessment. “Does it pass the sniff test?” I asked. Bob Page replied:

“We don’t see anybody today trying to build an EDW with Hadoop. This is a capability issue not a cost issue. Hadoop is not a data warehouse. Hadoop is not a database. Comparing these two for an EDW workload is comparing apples to oranges. I don’t know anybody who would try to build an EDW in Hadoop. There are many elements of the EDW on the technical side that are well refined and have been for 25 years. Things like workload management, the way concurrency works, and the way security works -- there are many different aspects of a modern EDW that you are not going to see in Hadoop today. I would not see these two as equivalent. So –no– it doesn’t pass the sniff test.”

Bob’s point – in my opinion – is the Hadoop-as-EDW cost model is invalid since Hadoop is not designed to handle EDW workloads.   Richard said he “gave Hadoop the benefit of the doubt” but I suspect the comparison was baked into his consulting contract with the Marketing CMO woman.  Ultimately, Richard and Bob agree from different angles.

There are a lot of press articles and zealots on the web who will argue these results.  But Richard and Bob have the hands-on credentials far beyond most people.  They have worked with dozens of big data implementations from 500TB to 10s of petabytes.  Please spend the time to listen to their webinar for an unbiased view.  The biased view – me – didn’t say all that much during the webinar.

Many CFO’s and CMO’s are grappling with the question “When do we use Hadoop and when should we use the data warehouse?”  Pass them the webinar link, call Richard, or call Bob.

 

Total Cost of Data Webinar

Big Data—What Does It Really Cost? (white paper)

The Real Cost of Big Data (Spreadsheet)

TCOD presentation slides (PDF)

Big Apple Hosts the Final Big Analytics Roadshow of the Year

Posted on: November 26th, 2013 by Teradata Aster No Comments

 

Speaking of ending things on a high note, New York City on December 6th will play host to the final event in the Big Analytics 2013 Roadshow series. Big Analytics 2013 New York is taking place at the Sheraton New York Hotel and Towers in the heart of Midtown on bustling 7th Avenue.

As we reflect on the illustrious journey of the Big Analytics 2013 Roadshow, kicking off in San Francisco, this year the Roadshow traveled through major international destinations including Atlanta, Dallas, Beijing, Tokyo, London and finally culminating at the Big Apple – it truly capsulated the appetite today for collecting, processing, understanding and analyzing data.

Big Analytics Atlanta 2013 photo

Big Analytics Roadshow 2013 stops in Atlanta

Drawing business & technical audiences across the globe, the roadshow afforded the attendees an opportunity to learn more about the convergence of technologies and methods like data science, digital marketing, data warehousing, Hadoop, and discovery platforms. Going beyond the “big data” hype, the event offered learning opportunities on how technologies and ideas combine to drive real business innovation. Our unyielding focus on results from data is truly what made the events so successful.

Continuing on with the rich lineage of delivering quality Big Data information, the New York event promises to pack tremendous amount of Big Data learning & education. The keynotes for the event include such industry luminaries as Dan Vesset, Program VP of Business Analytics at IDC, Tasso Argyros, Senior VP of Big Data at Teradata & Peter Lee, Senior VP of Tibco Software.

Photo of the Teradata Aster team in Dallas

Teradata team at the Dallas Big Analytics Roadshow


The keynotes will be followed by three tracks around Big Data Architecture, Data Science & Discovery & Data Driven Marketing. Each of these tracks will feature industry luminaries like Richard Winter of WinterCorp, John O’Brien of Radiant Advisors & John Lovett of Web Analytics Demystified. They will be joined by vendor presentations from Shaun Connolly of Hortonworks, Todd Talkington of Tableau & Brian Dirking of Alteryx.

As with every Big Analytics event, it presents an exciting opportunity to hear first hand from leading organizations like Comcast, Gilt Groupe & Meredith Corporation on how they are using Big Data Analytics & Discovery to deliver tremendous business value.

In summary, the event promises to be nothing less than the Oscars of Big Data and will bring together the who’s who of the Big Data industry. So, mark your calendars, pack your bags and get ready to attend the biggest Big Data event of the year.