Hadoop

Hadoop Summit June 2015: 4 Takeaways

Posted on: June 18th, 2015 by Cesar Rojas No Comments

 

For those in data—the developers, architects, administrators and analysts who capture, distill and integrate complex information for their organizations—the Hadoop Summit is one of the most important events of the year. We get to talk, share and learn from each other about how we can make Hadoop key to the enterprise data architecture.

The 2015 conference, held this month in San Jose, Calif., lived up to its billing. As a sponsor, Teradata had a big presence, including a booth that provided real-time demonstrations of our data solutions, as well as a contribution to the dialogue, with experts leading informative talks.

  • Peyman Mohajerian and Bill Kornfeld from Think Big  spoke on the new business value of a data lake strategy .
  • Teradata’s Justin Borgman,and Chris Rocca,  explored the future of Hadoop and SQL.

Over the course of the conference some big themes emerged. Here’s our insider look at the top takeaways from the 2015 Hadoop Summit:

1. Have no fear.

Yes, big data is here to stay.  And the opportunities to be gained are too great to let fear of failure guide your organization’s actions. David T. Lin, leader and evangelist of cloud platform engineering for Symantec, summed it up well: “Kill the fear. Haters to the left. Get it started and go.”

2. Take it step by step.

There’s an abundance of paths you can take to use and derive insights from your data.  Start small and scale. Hemal Gandhi, director of data engineering at One Kings Lane, said a good way to do that is to think like a startup, which often runs on innovation and agility. “There are lots of challenges in building highly scalable big data platforms … we took an approach that allows us to build a scalable data platform rapidly.”

 

3. Use predictive analytics.

Predictive analytics are worth taking the risk because they help uncover an organization’s next-best action to progress toward a goal. Alexander Gray, CTO of Skytree, discussed the benefits of “bigger” data and how those benefits can be quantified—in dollar terms. Because data size is a basic lever for predictive power, Gray said, “increasing business value is achieved by increasing predictive power.”

4. Personalize customer experiences.

Siloed applications combined in the Lambda architecture allow you to give your customers an experience that is tailored their needs. Russell Foltz-Smith, vice president of data platform at TrueCar, said his system allows his company to accurately identify, assess value, predict and prescribe “who, what and where” —giving customers the transparency they’re increasingly demanding. “We need to make everything easily accessible,” Williams said. “We are moving to a contextually aware, intelligent search engine. You have to open it up and let people forage through your data to find what they need.

Were you able to attend the Hadoop Summit or follow it online? What lessons did you take away from the event? Share your top Hadoop Summit insights in the comments below

Regulating Data Lake Temperature

Posted on: June 15th, 2015 by Mark Cusack No Comments

 

By Mark Cusack, Chief Architect, Teradata RainStor

One of the entertaining aspects of applying physical analogies to data technology is seeing how far you can push the analogy before it falls over or people get annoyed.  In terms of analogical liberties, I’d suggest that the data lake occupies the number one spot right now.  It’s almost mandatory to talk of raw data being pumped into a data lake, of datamarts drawing on filtered data from a lakeside location, and of data scientists plumbing the data depths for statistical insight.

This got me thinking about what other physical processes affecting real lakes I could misappropriate.  I am a physicist, so I’ll readily misuse physical phenomena and processes to help illustrate logical processes if I think I can get away with it.  There are two important processes in real lakes that are worth bending out of shape to fit our illustrative needs. These are stratification and turnover.

Data Stratification

Let’s look at stratification first.  During the summer months, the water at the surface of a proper lake heats up, providing a layer of insulation to the colder waters below, which results in layers of water with quite distinct densities and temperatures.  Right away we can adopt the notion of hot and cold data as stratified layers within our data lake.  This isn’t a completely terrible analogy, as the idea of data temperature based on access frequency is well established, and Teradata has been incorporating hot and cold running data storage into its Integrated Data Warehouse for a while now.

Storing colder data is something we’re focused on at Teradata RainStor too.  One of RainStor’s use cases involves offloading older, colder data from a variety of RDBMS in order to buy back capacity from those source systems.  RainStor archives the low temperature data in a highly compressed – dense – form in a data lake, while still providing full interactive query access to the offloaded data. In this use case, RainStor is deployed in a secondary role behind one or more primary RDBMS.  Users can query this cold layer of data in RainStor directly via RainStor’s own parallel SQL query engine.  In addition, Teradata Integrated Data Warehouse users are able to efficiently query data stored in RainStor running on Hadoop via the Teradata® QueryGrid™.

Increasingly, however, RainStor is being deployed on a data lake as more than just an archive for cold data.  It’s being deployed as the system-of-record for structured data – as the primary repository for a mix of data of different temperatures and from different sources, all stored with original fidelity.   The common feature of this mixed data is that it doesn’t change, and so it fits in well with RainStor’s immutable data model, which can store and manage data on Hadoop and also on compliance-oriented WORM devices.

Data Turnover

The mixing of the data layers in the system-of-record use case is analogous to the turnover process in real lakes.  In winter months the upper layers of water cool and descend, displacing deeper waters to cause a mixing or turnover of the lake.  The turnover process is important in a watery lake as it mixes oxygen-poor water lower down with oxygen-rich surface water, supporting the ecosystem at all lake depths.

The lack of data stratification in a data lake is also important since one data scientist’s cold data is another one’s hot data.  By providing the same compression, SQL query, security and data life-cycle management capabilities to all data stored in RainStor, a data scientist pays no penalty for accessing the raw data in whatever way they choose to, be it through RainStor’s own SQL engine, Hive, Pig, MapReduce, HCatalog, or via the QueryGrid.

I’ve stretched the data lake metaphor to its limits in this post. The serious point is that data lakes are no longer seen as being supplied from a single operational source, as per the original definition.  They may be fed from a range of sources, with the data itself varying in structure.  Not only is schema flexibility a requirement for many data scientists, so too is the need for equally fast access to all data in the lake, free from the data temperature prejudices that might exist in upstream systems.

 

Mark Cusack, Teradata RainStorMark Cusack joined Teradata in 2014 as part of its RainStor acquisition. As a founding developer and Chief Architect at RainStor, he has worked on many different aspects of the product since 2004.  Most recently, Mark led the efforts to integrate RainStor with Hadoop and with Teradata. He holds a Masters in computing and a PhD in physics.

 

It happens every few years and it’s happening again. A new technology comes along and a significant segment of the IT and business community want to toss out everything we’ve learned over the past 60 years and start fresh. We “discover” that we’ve been wasting time applying unnecessary rigor and bureaucracy to our projects. No longer should we have to wait three to six months or longer to deliver technical solutions to the business. We can turn these things around in three to six days or even less.

In the mid 1990’s, I was part of a team that developed a “pilot” object-oriented, client-server (remember when these were the hot buzzwords?) application to replenish raw materials for a manufacturing function. We were upending the traditional mainframe world by delivering a solution quickly and iteratively with a small team. When the end users started using the application in real life, it was clear they were going to rely on it to do their jobs every day. Wait, was this a pilot or…? I would come into work in the morning, walk into a special room that housed the application and database servers, check the logs, note any errors, make whatever fixes needed to be made, re-run jobs, and so on.

It wasn’t long before this work began to interfere with my next project, and the end users became frustrated when I wasn’t available to fix problems quickly. It took us a while and several conversations with operations to determine that “production” didn’t just mean “the mainframe”. “Production” meant that people were relying on the solution on a regular basis to do their jobs. So we backtracked and started talking about what kind of availability guarantees we could make, how backup and recovery should work, how we could transition monitoring and maintenance to operations, and so on. In other words, we realized what we needed was a traditional IT project that just happened to leverage newer technologies.

This same scenario is happening today with Hadoop and related tools. When I visit client organizations, a frightening number will have at least one serious person saying something like, “I really don’t think ‘data warehousing’ makes sense any more. It takes too long. We should put all our data in Hadoop and let our end users access whatever they want.” It is indeed a great idea to establish an environment that enables exploration and quick-turnaround analysis against raw data and production data. But to position this approach as a core data and analytics strategy is nothing short of professional malpractice.

The problem is that people are confusing experimentation with IT projects. There is a place for both, and there always has been. Experimentation (or discovery, research, ad-hoc analysis, or whatever term you wish to use) should have lightweight processes and data management practices – it requires prioritization of analysis activity, security and privacy policies and implementation, some understanding of available data, and so on, but it should not be overburdened with the typical rigor required of projects that are building solutions destined for production. Once a prototype is ready to be used on a regular basis for important business functions, that solution should be built through a rigorous IT project leveraging an appropriate – dare I say it – solution development life cycle (SDLC), along with a comprehensive enterprise architecture plan including, yes, a data warehouse that provides integrated, shared, and trusted production data.

An experimental prototype should never be “promoted” to a production environment. That’s what a project is for. Experimentation can be accomplished with Hadoop, relational technology, Microsoft Office, and many other technologies. These same technologies can also be used for production solutions. So, it’s not that “things are done differently and more quickly in Hadoop”. Instead, it’s more appropriate to say that experimentation is different than an IT project, regardless of technology.

Yes, we should do everything we can to reduce unnecessary paperwork and to speed up delivery using proper objective setting, scoping, and agile development techniques. But that is different than abandoning rigor altogether. In fact, using newer technologies in IT projects requires more attention to detail, not less, because we have to take the maturity of the technology into consideration. Can it meet the service level needs of a particular solution? This needs to be asked and examined formally within the project.

Attempting to build production solutions using ad-hoc, experimental data preparation and analysis techniques is like building a modern skyscraper with a grass hut mentality. It just doesn’t make any sense.

Guest Blogger Kevin Lewis is responsible for Teradata’s Strategy and Governance practice. Prior to joining Teradata in 2007, he was responsible for initiating and leading enterprise data management at Publix Super Markets. Since joining Teradata, he has advised dozens of clients in all major industries. 

Taking Charge of Your Data Lake Destiny

Posted on: April 30th, 2014 by Cesar Rojas No Comments

 

One of the most interesting areas of my job is having the opportunity to take an active role in helping shape the future of big data analytics. A significant part of this is the maturation of the open source offerings available to customers and how they can help address today’s analytic conundrums. Customers are constantly looking for new and effective ways to organize their data and they want to build the systems that will empower them to be successful across their organizations. But with the proliferation of data and the rise of data types and analytical models, solving this challenge is becoming increasingly complex.

One of the solutions that has become popular is the concept of a data lake. The idea of a data lake emerged when users were creating new types of data that needed to be captured and exploited across their enterprise. The concept is also tied quite closely to Apache Hadoop and its ecosystem of open source projects so, as you can imagine, since two of my main focus areas (big data analytics and Hadoop) are being brought together, this is an area to which I often pay close attention. Data lakes are designed to tackle some of the emerging big data challenges by offering a new way to organize and build the next generation of systems. They provide a cost effective and technologically refined way to approach and solve big data challenges. However, while data lakes are an important component of the logical data warehouse – because they are designed to give the user choices in order to better manage and utilize data within their analytical ecosystem – many users are also finding that the data lake is also an obvious evolution of their current Apache Hadoop ecosystem and their existing data architecture.

Where do we begin? Quite simply, several questions need to be answered before you start down this path. For instance, it’s important to understand how the data lake is related to your existing enterprise data warehouse, how they work together, and quite possibly the most important question is “What best practices should be leveraged to ensure the resulting strategy drives business value?”

A recent white paper written by CITO Research and sponsored by Teradata and Hortonworks, takes a close look at the data lake and provides answers to all of the above questions, and then some. Without giving away too much of the detail, I thought I would capture a few of the points that impress me most in this paper.

In fact, the data lake has come a long way since its initial entry into the big data scene. Its first iteration included several limitations, making it slightly daunting to general users. The original data lakes were batch-oriented, offering very limited abilities for user interaction with the data, and expertise with MapReduce and other scripting and query capabilities were absolutely necessary for success. Those factors, among others, limited its ability to be widely adopted. Today, however, the landscape is changing. With the arrival of Hadoop 2, and more specifically the release 2.1 of Hortonworks, data lakes are evolving. New Hadoop projects bolstered better resource management and application multi-tenancy allowing multiple workloads on the same cluster that enable users from different business units within organizations to effectively refine, explore, and enrich data. Today, enterprise Hadoop is a full-fledged data lake, with new capabilities being added all the time.

While the capabilities of a data lake evolved over the last few years, so did the world of big data. Companies everywhere started creating data lakes to complement the capabilities of their data warehouses but now must also tackle creating a logical data warehouse in which the data lake and the enterprise data warehouse can be maximized individually -- and yet support each other in the best way possible as well.

The enterprise data warehouse plays a critical role in solving big data challenges, and together with the data lake, the possibilities can deliver real business value. The enterprise data warehouse is a highly designed sophisticated system that provides a single version of the truth that can be used over and over again. And, like a data lake, it supports batch workloads. Unlike a data lake, the enterprise data warehouse also supports simultaneous use by thousands of concurrent users performing reporting and analytic tasks.

There are several impressive uses for a data lake and several beneficial outcomes can result. It is very worthwhile to learn more about data lakes and how they can help you to store and process data at low cost. You can also learn how to create a distributed form of analytics, or learn how the data lake and the enterprise data warehouse have started to work together as a hybrid, unified system that empowers users to ask questions that can be answered by more data and more analytics with less effort. To start learning about these initiatives, download our whitepaper here.

By Cesar Rojas - bio link 

Take a Giant Step with Teradata QueryGrid

Posted on: April 29th, 2014 by Dan Graham No Comments

 

Teradata 15.0 has gotten tremendous interest from customers and the press because it enables SQL access to native JSON data. This heralds the end of the belief that data warehouses can’t handle unstructured data. But there’s an equally momentous innovation in this release called Teradata QueryGrid.

What is Teradata QueryGrid?
In Teradata’s Unified Data Architecture (UDA), there are three primary platforms: the data warehouse, the discovery platform, and the data platform. The huge gray arrows represent data flowing between these systems. A year or two ago, these arrows were extract files moved in batch mode.

Teradata QueryGrid is both a vision and a technology. The vision --simply said-- is that a business person connected to the Teradata Database or Aster Database can submit a single SQL query that joins data together from a second system for analysis. There’s no need to plead with the programmers to extract data and load it into another machine. The business person doesn’t have to care where the data is – they can simply combine relational tables in Teradata with tables or flat files found in Hadoop on demand. Imagine a data scientist working on an Aster discovery problem and needing data from Hadoop. By simply adjusting the queries she is already using, Hadoop data is fetched and combined with tables in the Aster Database. That should be a huge “WOW” all by itself but let’s look further.

You might be saying “That’s not new. We’ve had data virtualization queries for many years.” Teradata QueryGrid is indeed a form of data virtualization. But Teradata QueryGrid doesn’t suffer from the normal limitations of data virtualization such as slow performance, clogged networks, and security concerns.

Today, the vision is translated into reality as connections between Teradata Database and Hadoop as well as Aster Databases and Hadoop. Teradata QueryGrid also connects the Teradata Data Warehouse to Oracle databases. In the near future, it will extend to all combinations of UDA servers such as Teradata to Aster, Aster to Aster, Teradata to Teradata, and so on.

Seven League Boots for SQL
With QueryGrid, you can add a clause in a SQL statement that says “Call up Hadoop, pass Hive a SQL request, receive the Hive results, and join it to the data warehouse tables.” Running a single SQL statement spanning Hadoop and Teradata is amazing in itself – a giant step forward. Notice too that all the database security, advanced SQL functions, and system management in the Teradata or Aster system is supporting these queries. The only effort required is for the database administrator to set up a “view” that connects the systems. It’s self-service for the business user after that. Score: complexity zero, business users one.

Parallel Performance, Performance, Performance
Historically, data virtualization tools lack the ability to move data between systems in parallel. Such tools send a request to a remote database and the data comes back serially through an Ethernet wire. Teradata QueryGrid is built to connect to remote systems in parallel and exchange data through many network connections simultaneously. Wanna move a terabyte per minute? With the right configurations it can be done. Parallel processing by both systems makes this incredibly fast. I know of no data virtualization system that does this today.

Inevitably, the Hadoop cluster will have a different number of servers compared to the Teradata or Aster MPP systems. The Teradata and Aster systems start the parallel data exchange by matching up units of parallelism between the two systems. That is, all the Teradata parallel workers (called AMPs) connect to a buddy Hadoop worker node for maximum throughput. Anytime the configuration changes, the workers match-up changes. This is non-trivial rocket-science class technology. Trust me – you don’t want to do this yourself and the worst situation would be to expose this to the business users. But Teradata QueryGrid does it all for you completely invisible to the user.

Put Data in the Data Lake FAST
Imagine complex predictive analytics using R® or SAS® are run inside the Teradata data warehouse as part of a merger and acquisition project. In this case, we want to pass this data to the Hadoop Data Lake where it is combined with temporary data from the company being acquired. With moderately simple SQL stuffed in a database view, the answers calculated by the Teradata Database can be sent to Hadoop to help finish up some reports. Bi-directional data exchange is another breakthrough in the Teradata Query Grid, new in release 15.0. The common thread in all these innovations is that the data moves from the memory of one system to the memory of the other. No extracts, no landing the data on disk until the final processing step – and sometimes not even then.

Push-down Processing
What we don’t want to do is transfer terabytes of data from Hadoop and throw away 90% of it since it’s not relevant. To minimize data movement, Teradata QueryGrid sends the remote system SQL filters that eliminate records and columns that aren’t needed. An example constraint could be “We only want records for single women age 30-40 with an average account balance over $5000. Oh, and only send us the account number, account type, and address.” This way, the Hadoop system discards unnecessary data so it doesn’t flood the network with data that will be thrown away. After all the processing is done in Hadoop, data is joined in the data warehouse, summarized, and delivered to the user’s favorite business intelligence tool.

Teradata QueryGrid delivers some important benefits:
• It’s easy to use: any user with any BI tool can do it
• Low DBA labor: it’s mostly setting up views and testing them once
• High performance: reducing hours to minutes means more accuracy and faster turnaround for demanding users
• Cross-system data on demand: don’t get stuck in programmer’s work queue
• Teradata/Aster strengths: security, workload management, system management
• Minimum data movement improves performance and reduces network use
• Move the processing to the data

Big data is now taking giant steps through your analytic architecture --frictionless, invisible, and in parallel. Nice boots!

LA kicks off the 2014 Teradata User Group Season

Posted on: April 22nd, 2014 by Guest Blogger No Comments

 

By Rob Armstrong,  Director, Teradata Labs Customer Briefing Team

After presenting for years at the Teradata User Group meetings, it was refreshing to see some changes in this roadshow.  While I had my usual spot on the agenda to present Teradata’s latest database release (15.0), we had some hot new topics including Cloud and Hadoop, some more business level folks were there, more companies researching Teradata’s technology (vs. just current users) and there was a hands-on workshop the following day for the more technically inclined looking to walk through real world Unified Data Architecture™ (UDA) use cases of a Teradata customer.  While LA tends to be a smaller venue than most, the room was packed and we had 40% more attendees compared with last year.

I would be remiss if I did not give a big Thanks to the partner sponsors of the user group meeting.  In LA we had Hortonworks and Dot Hill as our gold and silver sponsors.  I took a few minutes to chat with them and found out some interesting upcoming items.  Most notably, Lisa Sensmeier from Hortonworks talked to me about Hadoop Summit which is coming up in San Jose, June 3-5th.  Jim Jonez, from Dot Hill, gave me the latest on their newest “Ultra Large” disk technology where they’ll have 48 1 TB drives in a single 2U rack.  It is not in the Teradata line up yet, but we are certainly intrigued for the proper use case.

Now, I’d like to take a few minutes to toot my own horn about the Teradata Database 15.0 presentation that had some very exciting elements to help change the way users get to and analyze all of their data.  You may have seen the recent news releases, but if not, here is a quick recap:

  • Teradata 15.0 continues our Unified Data architecture with the new Teradata QueryGrid.  This is the new environment to define and access data from Teradata to other data servers such as Apache Hadoop (Hortonwoks), Teradata Aster Discovery Platform, Oracle, and others.  This lays the foundation for an extension to even more foreign data servers.  15.0 simplifies the whole definition and usage as well as added bi-directional and predicate pushdown.  In a related session, Cesar Rojas provided some good recent examples of customers taking advantage of the entire UDA ecosystem where data from all of the Teradata offerings were used together to generate new actions.
  • The other big news in 15.0 is the inclusion of the JSON data type.  This allows customers to store direct JSON documents in a column and then apply “schema on read” for much greater flexibility with greatly reduced IT resources.  As the JSON document changes, there is no table or database changes necessary to absorb the new content.

Keep your eyes and ears open for the next Teradata User Group event coming your way, or better yet, just go to the webpage: http://www.teradata.com/user-groups/ to see where the bus stops next and to register.  The TUGs are free of charge.  Perhaps we’ll cross paths as I make the circuit? Until then, ‘Keep Calm and Analyze On’ (as the cool kids say).

 Since joining Teradata in 1987, Rob Armstrong has worked in all areas of the data warehousing arena.  He has gone from writing and supporting the database code to implementing and managing systems at some of Teradata’s largest and most innovative customers.  Currently Rob provides sales and marketing support by traveling the globe and evangelizing the Teradata solutions.

 

The best Strata session that I attended was the overview Kurt Brown gave of the Netflix data platform, which contained hype-deflating lessons and many chestnuts of tech advice straight from one of the most intense computing environments on the planet.

Brown, who as a director leads the design and implementation of the data platform, had a cheerful demeanor but demonstrated ruthless judgment and keen insight in his assessment of how various technologies serve the goals of Netflix. It was interesting to me how dedicated he was to both MPP SQL technology and to Apache™ Hadoop.

I attended the session with Daniel Graham, Technical Marketing Specialist of Teradata, who spoke with me afterward about the implications of the Netflix architecture and Brown’s point of view.

SQL Vs Hadoop
Brown rejected the notion that it was possible to build a complete data platform exclusively using either SQL technology or Hadoop alone. In his presentation, Brown explained how Netflix made great use of Hadoop, used Hive for various purposes, and had an eye on Presto, but also couldn’t live without Teradata and Microstrategy as well.

Brown recalled a conversation in which another leader of a data platform explained that he was discarding all his data warehouse technology and going to put everything on Hive. Brown’s response, “Why would you ever want to do that?”

While Brown said he enjoyed the pressure that open source puts on commercial vendors to improve, he was dedicated to using whatever technology could provide answers to questions in the most cost-effective manner. Brown said he was especially pleased that Teradata was going to be able to support a cloud-based implementation that could run at scale. Brown said that Netflix had upwards of 5 petabytes of data in the cloud, all stored on Amazon S3.

After the session, I pointed out to Graham that the pattern in evidence at Netflix and most of the companies who are acknowledged as the leaders in big data, mimics the recommendation of the white paper “Optimize the Value of All Your Enterprise Data” that provides an overview of the Teradata Unified Data Architecture™.

The Unified Data Architecture recommends that that the data that has the most “business value density” be stored in an enterprise data warehouse powered by MPP SQL. This data is used most often by the most users. Hadoop is used as a data refinery to process flat files or NoSQL data in batch mode.

Netflix is a big data companies that arrived at this pattern by adding SQL to a Hadoop infrastructure. Many well-known users of huge MPP SQL installations have added Hadoop.

“Data doesn’t stay unstructured for long. Once you have distilled it, it usually has a structure that is well-represented by flat files,” said Teradata's Graham. “This is the way that the canonical model of most enterprise activity is stored. Then the question is: How you ask questions of that data? There are numerous ways to make this easy for users, but almost all of those ways pump out SQL that then is used to grab the data that is needed.”

Replacing MPP SQL with Hive or Presto is a non-starter because to really support hundreds or thousands of users who are pounding away at a lot of data, you need a way to provide speedy and optimized queries and also to manage the consumption of the shared resources.

“For over 35 years, Teradata has been working on making SQL work at scale for hundreds or thousands of people at a time,” said Graham. “It makes perfect sense to add SQL capability to Hadoop, but it will be a long time, perhaps a decade or more, before you will get the kind of query optimization and performance that Teradata provides. The big data companies use Teradata and other MPP SQL systems because they are the best tool for the job for making huge datasets of high business value density available to an entire company.”

Efforts such as Tez and Impala will clearly move Hive’s capability forward. The question is how far forward and how fast. We will know that victory has been achieved when Netflix, which uses Teradata in a huge cloud implementation, is able to support their analytical workloads with other technology.

Graham predicts that in 5 years, Hadoop will be a good data mart but will still have trouble with complex parallel queries.

“It is common for a product like Microstrategy to pump out SQL statements that may be 10, 20, or even 50 pages long,” said Graham. “When you have 5 tables, the complexity of the queries could be 5 factorial. With 50 tables, that grows to 50 factorial. Handling such queries is a 10- or 20-year journey. Handling them at scale is a feat that many companies can never pull off.”

Graham acknowledges the need for an MPP SQL data warehouse extended to support data discovery, e.g. Teradata Aster Discovery Platform, along with the extensions for using Hadoop and graph analytics through enhanced SQL, is needed by most businesses.

Teradata is working to demonstrate that the power of this collection of technology can address some of the unrealistic enthusiasm surrounding Hadoop.

By: Dan Woods, Forbes Blogger and Co-Founder of Evolved Media

 

In years past, Strata has celebrated the power of raw technology, so it was interesting to note how much the keynotes on Wednesday focused on applications, models, and how to learn and change rather than on speeds and feeds.

After attending the keynotes and some fascinating sessions, it seems clear that the blinders are off. Big data and data science have been proven in practice by many innovators and early adopters. The value of new forms of data and methods of analysis are so well established that there’s no need for exaggerated claims. Hadoop can do so many cool things that it doesn’t have to pretend to do everything, now or in the future. Indeed, the pattern in place at Facebook, Netflix, the Obama Campaign, and many other organizations with muscular data science and engineering departments is that MPP SQL and Hadoop sit side by side, each doing what they do best.

In his excellent session, Kurt Brown, Director, Data Platform at Netflix, recalled someone explaining that his company was discarding its data warehouse and putting everything on Hive. Brown responded, “Why would you want to do that?” What was obvious to Brown, and what he explained at length, is that the most important thing any company can do is assemble technologies and methods that serve its business needs. Brown demonstrated the logic of creating a broad portfolio that serves many different purposes.

Real Value for Real People
The keynotes almost all celebrated applications and models. Vendors didn’t talk about raw power, but about specific use cases and ease-of-use. Farrah Bostic, a marketing and product design consultant, recommended ways to challenge assumptions and create real customer intimacy. This was a key theme: Use the data to understand a person in their terms not yours. Bostic says you will be more successful if you focus on creating value for the real people who are your customers instead of extracting value from some stilted and limited model of a consumer. A skateboarding expert and a sports journalist each explained models and practices for improving performance. This is a long way from the days when a keynote would show a computer chewing through a trillion records.

Geoffrey Moore, the technology and business philosopher, was in true provocative form. He asserted that big data and data science are well on their way to crossing the chasm because so many upstarts pose existential threats to established businesses. This pressure will force big data to cross the chasm and achieve mass adoption. His money quote: "Without big data analytics, companies are blind and deaf, wandering out onto the Web like deer on the freeway.”

An excellent quote to be sure, but it goes too far. Moore would have been more accurate and less sensational if he said, “Without analytics,” not “Without big data analytics.” The reason that MPP SQL and Hadoop have made such a perfect pair is because more than one type of data and method of analysis is needed. Every business needs all the relevant data it can get to understand the people it does business with.

The Differentiator: A Culture of Analytics
The challenge I see companies facing lies in creating a culture of analytics. Tom Davenport has been a leader in promoting analytics as a means to competitive advantage. In his keynote at Strata Rx in September 2013, Davenport stressed the importance of integration.

In his session at Strata this year, Bill Franks, Chief Analytics Officer at Teradata, put it quite simply, "Big data must be an extension of an existing analytics strategy. It is an illusion that big data can make you an analytics company."

When people return from Strata and roll up their sleeves to get to work, I suspect that many will realize that it’s vital to make use of all the data in every way possible. But one person can only do so much. For data to have the biggest impact, people must want to use it. Implementing any type of analytics provides supply. Leadership and culture create demand. Companies like CapitalOne and Netflix don’t do anything without looking at the data.

I wish there were a shortcut to creating a culture of analytics, but there isn’t, and that’s why it’s such a differentiator. Davenport’s writings are probably the best guide, but every company must figure this out based on its unique situation.

Supporting a Culture of Analytics
If you are a CEO, your job is to create a culture of analytics so that you don’t end up like Geoffrey Moore’s deer on the freeway. But if you have Kurt Brown’s job, you must create a way to use all the data you have, to use the sweet spot of each technology to best effect, and to provide data and analytics to everyone who wants them.

At a company like Netflix or Facebook, creating such a data supply chain is a matter of solving many unique problems connected with scale and advanced analytics. But for most companies, common patterns can combine all the modern capabilities into a coherent whole.

I’ve been spending a lot of time with the thought leaders at Teradata lately and closely studying their Unified Data Architecture. Anyone who is seeking to create a comprehensive data and analytics supply chain of the sort in use at leading companies like Netflix should be able to find inspiration in the UDA, as described in a white paper called “Optimizing the Business Value of All Your Enterprise Data.”

The paper does excellent work in creating a framework for data processing and analytics that unifies all the capabilities by describing four use cases: the file system, batch processing, data discovery, and the enterprise data warehouse. Each of these use cases focuses on extracting value from different types of data and serving different types of users. The paper proposes a framework for understanding how each use case creates data with different business value density. The highest volume interaction takes place with data of the highest business value density. For most companies, this is the enterprise data warehouse, which contains a detailed model of all business operations that is used by hundreds or thousands of people. The data discovery platform is used to explore new questions and extend that model. Batch processing and processing of data in a file system extract valuable signals that can be used for discovery and in the model of the business.

While this structure doesn’t map exactly to that of Netflix or Facebook, for most businesses, it supports the most important food groups of data and analytics and shows how they work together.

The refreshing part of Strata this year is that thorny problems of culture and context are starting to take center stage. While Strata will always be chock full of speeds and feeds, it is even more interesting now that new questions are driving the agenda.

By: Dan Woods, Forbes Blogger and Co-Founder of Evolved Media

 

In the Star Trek movies, “the Borg” refers to an alien race that conquers all planets, absorbing the people, technology, and resources into the Borg collective. Even Captain Picard becomes a Borg and chants “We are the Borg. You will be assimilated. Resistance is futile.”

It strikes me that the relational database has behaved similarly since its birth. Over the last thirty years, Teradata and other RDBMS vendors have innovated and modernized, constantly revitalizing what it means to be an RDBMS. But some innovations come from start-up companies that are later assimilated into the RDBMS. And some innovations are reactions to competition. Regardless, many innovations eventually end up in the code base of multiple RDBMS vendor products --with proper respect to patents of course. Here are some examples of cool technologies assimilated into Teradata Database:

• MOLAP cubes storm the market in the late 1990s with Essbase setting the pace and Cognos inventing desktop cubes. MicroStrategy and Teradata team up to build push-down ROLAP SQL into the database for parallel speed. Hyperion Essbase and Teradata also did Hybrid OLAP integration together. Essbase gets acquired, MOLAP cubes fall out of fashion, and in-database ROLAP goes on to provide the best of both worlds as CPUs get faster.

• Early in the 2000s, a startup called Sunopsis shows a distinct advantage of running ELT transformations in-database to get parallel performance with Teradata. ELT takes off in the industry like a rocket. Teradata Labs also collaborates with Informatica to push-down PowerCenter transformation logic into SQL for amazing extract, load, and transform speed. Sunopsis gets acquired. More ETL vendors adopt ELT techniques. Happy DBAs and operations managers meet their nightly batch performance goals. More startups disappear.

• XML and XQuery becomes the rage in the press -- until most every RDBMS adds a data type for XML --plus shred and unshred operators. XML-only database startups are marginalized.

• The uptick of predictive analytics in the market drives collaboration between Teradata and SAS back in 2007. SAS Procs are pushed-down into the database to run massively parallel, opening up tremendous performance benefits for SAS users. This leads to many RDBMS vendors who copy this technique; SAS is in the limelight, and eventually even Hadoop programmers want to run SAS in parallel. Later we see “R,” Fuzzy Logix, and others run in-database too. Sounds like the proverbial win-win to me.

• In-memory technology from QlikView and TIBCO SpotFire excites the market with order-of magnitude performance gains. Several RDBMS vendors then adopt in-memory concepts. But in-memory has limitations on memory size and cost vis-à-vis terabytes of data. Consequently, Teradata introduces Teradata Intelligent Memory that caches hot data automatically in-memory while managing many terabytes of hot and cold data on disk. Two to three percent of the hottest data is managed by data temperature (aka - popular with users), delivering superfast response time. Cool! Or is it hot?

• After reading the Google research paper on MapReduce, a startup called “AsterData” invents SQL-MapReduce (SQL-MR) to add flexible processing to a flexible database engine. This cool innovation causes Teradata to acquire AsterData. Within a year, Aster strikes a nerve across the industry – MapReduce is in-database! This month, Aster earns numerous #1 scores in Ovum’s “Decision Matrix: Selecting an Analytic Database 2013-14” Jan 2014. The race is on for MapReduce in-database!

• The NoSQL community grabs headlines with their unique designs and reliance on JSON data and key-value pairs. MongoDB is hot, using JSON data while CouchBase and Cassandra leverage key-value stores. Teradata promptly decides to add JSON data (unstructured data) to the database and goes the extra mile to put JSONPath syntax into SQL. Teradata also adds the name-value-pair SQL operator (NVP) to extract JSON or key-value store data from weblogs. Schema-on-read technology gets assimilated into the Teradata Database. Java programmers are pleased. Customers make plans. More wins.

--------------------------------------------------------------------------------------------------------

“One trend to watch going forward, in addition to the rise of multi-model NoSQL databases, is the integration of NoSQL concepts into relational databases. One of the methods used in the past by relational database vendors to restrict the adoption of new databases to handle new data formats has been to embrace those formats within the relational database. Two prime examples would be support for XML and object-oriented programming.”
- Matt Aslett, The 451 Group, Next-Generation Operational Databases 2012-2016, Sep 17, 2013

--------------------------------------------------------------------------------------------------------

I’ve had conversations with other industry analysts and they’ve confirmed Matt’s opinion: RDBMS vendors will respond to market trends, innovations, and competitive threats by integrating those technologies into their offering. Unlike the Borg, a lot of these assimilations by RDBMS are friendly collaborations (MicroStrategy, Informatica, SAS, Fuzzy Logix, Revolution R, etc.). Others are just the recognition of new data types that need to be in the database (JSON, XML, BLOBs, geospatial, etc.).

Why is it good to have all these innovations inside the major RDBMS’s? Everyone is having fun right now with their science projects because hype is very high for this startup or that startup or this shiny new thing. But when it comes time to deploy production analytic applications to hundreds or thousands of users, all the “ities” become critical all of a sudden – “ities” that the new kids don’t have and the RDBMS does. “ities” like reliability, recoverability, security, and availability. Companies like Google can bury shiny new 1.oh-my-god quality software in an army of brilliant computer scientists. But Main Street and Wall Street companies cannot.

More important, many people are doing new multi-structured data projects in isolation -- such as weblog analysis, sensor data, graph analysis, or social text analysis. Soon enough they discover the highest value comes from combining that data with all the rest of the data that the organization has collected on customers, inventories, campaigns, financials, etc. Great, I found a new segment of buyer preferences. What does that mean to campaigns, sales, and inventory? Integrating new big data into an RDBMS is a huge win going forward – much better than keeping the different data sets isolated in the basement.

Like this year’s new BMW or Lexus, RDBMS’s modernize, they define modern. But relational database systems don’t grow old, they don’t rust or wear out. RDBMS’s evolve to stay current and constantly introduce new technology.

We are the RDBMS! Technology will be assimilated. Resistance is futile.