Hadoop

The Benefits and Evolution of the Hadoop Appliance

Posted on: July 9th, 2015 by Chris Twogood No Comments

 

Running Hadoop on an appliance offers significant benefits, but as Hadoop workloads become more sophisticated, so too must the appliance. That’s exactly why we’re releasing the ‘new’ Teradata Appliance for Hadoop 5. Our new appliance has evolved alongside Hadoop usage scenarios while giving IT organizations more freedom of choice to run diverse workloads. Running Hadoop on an appliance makes more sense than ever before.

If you’re running – or thinking about running – Hadoop on an appliance, you’re not alone. According to an ESG survey reported on by SearchDataCenter.com, 21% of IT organizations are considering dedicated analytics appliances. That’s the same percentage of organizations that are considering public cloud solutions and double those considering a public/private hybrid deployment. What is driving the adoption of Hadoop appliances?

5 Key Benefits of Running Hadoop on an Appliance

Organizations that choose to deploy Hadoop on an appliance versus rolling out their own solution realize five important benefits.

  1. Hadoop is delivered ready to run.

We’ve heard industry experts say that it can take IT organizations six to eight months to roll out a Hadoop implementation on their own. With a Teradata appliance, we’ve done all the hard work in terms of installing and configuring multiple types of software as well as installing and configuring the operating system, networking and the like. You simply plug it in, and within days you are up and running.

  1. We’vebuilt high availability into our Hadoop appliances.

The Teradata Vital Infrastructure (TVI) proactively detects and resolves incidents. In fact, up to 72% of all hardware- and software-related incidents are detected and resolved by TVI before the customer even knows about them. We also run BYNET over InfiniBand, which delivers automated network load balancing, automated network failover, redundancy across two active fabrics, and multiple levels of network isolation. These features in Teradata Appliance for Hadoop 5 deliver the high availability IT organizations need in an enterprise-grade solution.

  1. It is Unified Data Architecture ready.

It’s not enough to just efficiently deploy Hadoop. IT organizations must be able to efficiently deploy Hadoop as a seamless part of an interconnected analytics ecosystem. The UDA-ready Hadoop appliance becomes an integral part of the organization’s larger data fabric, with BYNET over InfiniBand interconnect between Hadoop, the Integrated Data Warehouse and Aster big data analytics, and software integration such as QueryGrid, Viewpoint, TDCH, and Smart Loader.

  1. Single vendor support.

An appliance replaces the multiple support contracts IT organizations have with their hardware provider, Hadoop vendor, OS vendor, and various utilities, with a single “hand to shake.” If there’s any problem, one phone call puts you in touch with Teradata’s world-class, 24/7, multi-language support for the entire solution stack. IT organizations are seeing increasing value in this benefit as the Hadoop ecosystem has many moving parts associated with it, and single vendor support provides peace of mind.

  1. Running Hadoop on an appliance lowers your total cost of ownership (TCO)

The cost of a Hadoop appliance includes much more than the hardware the software runs on. There are also costs associated with configuring the network, installing the OS, configuring the disks, installing the Hadoop environment, tuning the Hadoop environment, and testing. The costs for doing all this work internally add up, making the TCO of an appliance even more attractive.

What’s New with Teradata Appliance for Hadoop 5?

In addition to these five benefits, Teradata Appliance for Hadoop 5 delivers freedom of choice to run a variety of workloads. IT organizations now have more options when they run Hadoop on Teradata Appliance 5.

Recognizing that Hadoop workloads are diverse and evolving, Teradata Appliance for Hadoop 5 is available in three flexible configurations, enabling customers to select the configuration that best fits their workloads.

  • Performance configuration. For real-time processing and other workloads that require significant CPU, IO, and memory, we offer the performance configuration. This computational intensive configuration enables organizations to run emerging Hadoop workloads such as streaming, Spark, and SQL on Hadoop. With 24 cores, this configuration has more cores per node. It also has 512TB of RAM, 24 storage disks and 1.2TB drives.
  • Capacity configuration. The capacity configuration allows IT organizations to drive down the cost per terabyte. It is designed for heavy duty, long-running batch jobs as well as long-term archival and storage. It comes with 128- to 256TB RAM and 4TB disk drives.
  • Balance configuration. The balance configuration sits between the performance and capacity configurations, allowing IT organizations to strike the right balance for ETL and analytics jobs. The balance configuration features 24 cores and a 4TB capacity drive.

Learn more about Teradata’s Portfolio for Hadoop.

Hadoop Summit June 2015: 4 Takeaways

Posted on: June 18th, 2015 by Data Analytics Staff No Comments

 

For those in data—the developers, architects, administrators and analysts who capture, distill and integrate complex information for their organizations—the Hadoop Summit is one of the most important events of the year. We get to talk, share and learn from each other about how we can make Hadoop key to the enterprise data architecture.

The 2015 conference, held this month in San Jose, Calif., lived up to its billing. As a sponsor, Teradata had a big presence, including a booth that provided real-time demonstrations of our data solutions, as well as a contribution to the dialogue, with experts leading informative talks.

  • Peyman Mohajerian and Bill Kornfeld from Think Big  spoke on the new business value of a data lake strategy .
  • Teradata’s Justin Borgman,and Chris Rocca,  explored the future of Hadoop and SQL.

Over the course of the conference some big themes emerged. Here’s our insider look at the top takeaways from the 2015 Hadoop Summit:

1. Have no fear.

Yes, big data is here to stay.  And the opportunities to be gained are too great to let fear of failure guide your organization’s actions. David T. Lin, leader and evangelist of cloud platform engineering for Symantec, summed it up well: “Kill the fear. Haters to the left. Get it started and go.”

2. Take it step by step.

There’s an abundance of paths you can take to use and derive insights from your data.  Start small and scale. Hemal Gandhi, director of data engineering at One Kings Lane, said a good way to do that is to think like a startup, which often runs on innovation and agility. “There are lots of challenges in building highly scalable big data platforms … we took an approach that allows us to build a scalable data platform rapidly.”

 

3. Use predictive analytics.

Predictive analytics are worth taking the risk because they help uncover an organization’s next-best action to progress toward a goal. Alexander Gray, CTO of Skytree, discussed the benefits of “bigger” data and how those benefits can be quantified—in dollar terms. Because data size is a basic lever for predictive power, Gray said, “increasing business value is achieved by increasing predictive power.”

4. Personalize customer experiences.

Siloed applications combined in the Lambda architecture allow you to give your customers an experience that is tailored their needs. Russell Foltz-Smith, vice president of data platform at TrueCar, said his system allows his company to accurately identify, assess value, predict and prescribe “who, what and where” —giving customers the transparency they’re increasingly demanding. “We need to make everything easily accessible,” Williams said. “We are moving to a contextually aware, intelligent search engine. You have to open it up and let people forage through your data to find what they need.

Were you able to attend the Hadoop Summit or follow it online? What lessons did you take away from the event? Share your top Hadoop Summit insights in the comments below

Regulating Data Lake Temperature

Posted on: June 15th, 2015 by Mark Cusack No Comments

 

By Mark Cusack, Chief Architect, Teradata RainStor

One of the entertaining aspects of applying physical analogies to data technology is seeing how far you can push the analogy before it falls over or people get annoyed.  In terms of analogical liberties, I’d suggest that the data lake occupies the number one spot right now.  It’s almost mandatory to talk of raw data being pumped into a data lake, of datamarts drawing on filtered data from a lakeside location, and of data scientists plumbing the data depths for statistical insight.

This got me thinking about what other physical processes affecting real lakes I could misappropriate.  I am a physicist, so I’ll readily misuse physical phenomena and processes to help illustrate logical processes if I think I can get away with it.  There are two important processes in real lakes that are worth bending out of shape to fit our illustrative needs. These are stratification and turnover.

Data Stratification

Let’s look at stratification first.  During the summer months, the water at the surface of a proper lake heats up, providing a layer of insulation to the colder waters below, which results in layers of water with quite distinct densities and temperatures.  Right away we can adopt the notion of hot and cold data as stratified layers within our data lake.  This isn’t a completely terrible analogy, as the idea of data temperature based on access frequency is well established, and Teradata has been incorporating hot and cold running data storage into its Integrated Data Warehouse for a while now.

Storing colder data is something we’re focused on at Teradata RainStor too.  One of RainStor’s use cases involves offloading older, colder data from a variety of RDBMS in order to buy back capacity from those source systems.  RainStor archives the low temperature data in a highly compressed – dense – form in a data lake, while still providing full interactive query access to the offloaded data. In this use case, RainStor is deployed in a secondary role behind one or more primary RDBMS.  Users can query this cold layer of data in RainStor directly via RainStor’s own parallel SQL query engine.  In addition, Teradata Integrated Data Warehouse users are able to efficiently query data stored in RainStor running on Hadoop via the Teradata® QueryGrid™.

Increasingly, however, RainStor is being deployed on a data lake as more than just an archive for cold data.  It’s being deployed as the system-of-record for structured data – as the primary repository for a mix of data of different temperatures and from different sources, all stored with original fidelity.   The common feature of this mixed data is that it doesn’t change, and so it fits in well with RainStor’s immutable data model, which can store and manage data on Hadoop and also on compliance-oriented WORM devices.

Data Turnover

The mixing of the data layers in the system-of-record use case is analogous to the turnover process in real lakes.  In winter months the upper layers of water cool and descend, displacing deeper waters to cause a mixing or turnover of the lake.  The turnover process is important in a watery lake as it mixes oxygen-poor water lower down with oxygen-rich surface water, supporting the ecosystem at all lake depths.

The lack of data stratification in a data lake is also important since one data scientist’s cold data is another one’s hot data.  By providing the same compression, SQL query, security and data life-cycle management capabilities to all data stored in RainStor, a data scientist pays no penalty for accessing the raw data in whatever way they choose to, be it through RainStor’s own SQL engine, Hive, Pig, MapReduce, HCatalog, or via the QueryGrid.

I’ve stretched the data lake metaphor to its limits in this post. The serious point is that data lakes are no longer seen as being supplied from a single operational source, as per the original definition.  They may be fed from a range of sources, with the data itself varying in structure.  Not only is schema flexibility a requirement for many data scientists, so too is the need for equally fast access to all data in the lake, free from the data temperature prejudices that might exist in upstream systems.

 

Mark Cusack, Teradata RainStorMark Cusack joined Teradata in 2014 as part of its RainStor acquisition. As a founding developer and Chief Architect at RainStor, he has worked on many different aspects of the product since 2004.  Most recently, Mark led the efforts to integrate RainStor with Hadoop and with Teradata. He holds a Masters in computing and a PhD in physics.

 

It happens every few years and it’s happening again. A new technology comes along and a significant segment of the IT and business community want to toss out everything we’ve learned over the past 60 years and start fresh. We “discover” that we’ve been wasting time applying unnecessary rigor and bureaucracy to our projects. No longer should we have to wait three to six months or longer to deliver technical solutions to the business. We can turn these things around in three to six days or even less.

In the mid 1990’s, I was part of a team that developed a “pilot” object-oriented, client-server (remember when these were the hot buzzwords?) application to replenish raw materials for a manufacturing function. We were upending the traditional mainframe world by delivering a solution quickly and iteratively with a small team. When the end users started using the application in real life, it was clear they were going to rely on it to do their jobs every day. Wait, was this a pilot or…? I would come into work in the morning, walk into a special room that housed the application and database servers, check the logs, note any errors, make whatever fixes needed to be made, re-run jobs, and so on.

It wasn’t long before this work began to interfere with my next project, and the end users became frustrated when I wasn’t available to fix problems quickly. It took us a while and several conversations with operations to determine that “production” didn’t just mean “the mainframe”. “Production” meant that people were relying on the solution on a regular basis to do their jobs. So we backtracked and started talking about what kind of availability guarantees we could make, how backup and recovery should work, how we could transition monitoring and maintenance to operations, and so on. In other words, we realized what we needed was a traditional IT project that just happened to leverage newer technologies.

This same scenario is happening today with Hadoop and related tools. When I visit client organizations, a frightening number will have at least one serious person saying something like, “I really don’t think ‘data warehousing’ makes sense any more. It takes too long. We should put all our data in Hadoop and let our end users access whatever they want.” It is indeed a great idea to establish an environment that enables exploration and quick-turnaround analysis against raw data and production data. But to position this approach as a core data and analytics strategy is nothing short of professional malpractice.

The problem is that people are confusing experimentation with IT projects. There is a place for both, and there always has been. Experimentation (or discovery, research, ad-hoc analysis, or whatever term you wish to use) should have lightweight processes and data management practices – it requires prioritization of analysis activity, security and privacy policies and implementation, some understanding of available data, and so on, but it should not be overburdened with the typical rigor required of projects that are building solutions destined for production. Once a prototype is ready to be used on a regular basis for important business functions, that solution should be built through a rigorous IT project leveraging an appropriate – dare I say it – solution development life cycle (SDLC), along with a comprehensive enterprise architecture plan including, yes, a data warehouse that provides integrated, shared, and trusted production data.

An experimental prototype should never be “promoted” to a production environment. That’s what a project is for. Experimentation can be accomplished with Hadoop, relational technology, Microsoft Office, and many other technologies. These same technologies can also be used for production solutions. So, it’s not that “things are done differently and more quickly in Hadoop”. Instead, it’s more appropriate to say that experimentation is different than an IT project, regardless of technology.

Yes, we should do everything we can to reduce unnecessary paperwork and to speed up delivery using proper objective setting, scoping, and agile development techniques. But that is different than abandoning rigor altogether. In fact, using newer technologies in IT projects requires more attention to detail, not less, because we have to take the maturity of the technology into consideration. Can it meet the service level needs of a particular solution? This needs to be asked and examined formally within the project.

Attempting to build production solutions using ad-hoc, experimental data preparation and analysis techniques is like building a modern skyscraper with a grass hut mentality. It just doesn’t make any sense.

Guest Blogger Kevin Lewis is responsible for Teradata’s Strategy and Governance practice. Prior to joining Teradata in 2007, he was responsible for initiating and leading enterprise data management at Publix Super Markets. Since joining Teradata, he has advised dozens of clients in all major industries. 

Taking Charge of Your Data Lake Destiny

Posted on: April 30th, 2014 by Data Analytics Staff No Comments

 

One of the most interesting areas of my job is having the opportunity to take an active role in helping shape the future of big data analytics. A significant part of this is the maturation of the open source offerings available to customers and how they can help address today’s analytic conundrums. Customers are constantly looking for new and effective ways to organize their data and they want to build the systems that will empower them to be successful across their organizations. But with the proliferation of data and the rise of data types and analytical models, solving this challenge is becoming increasingly complex.

One of the solutions that has become popular is the concept of a data lake. The idea of a data lake emerged when users were creating new types of data that needed to be captured and exploited across their enterprise. The concept is also tied quite closely to Apache Hadoop and its ecosystem of open source projects so, as you can imagine, since two of my main focus areas (big data analytics and Hadoop) are being brought together, this is an area to which I often pay close attention. Data lakes are designed to tackle some of the emerging big data challenges by offering a new way to organize and build the next generation of systems. They provide a cost effective and technologically refined way to approach and solve big data challenges. However, while data lakes are an important component of the logical data warehouse – because they are designed to give the user choices in order to better manage and utilize data within their analytical ecosystem – many users are also finding that the data lake is also an obvious evolution of their current Apache Hadoop ecosystem and their existing data architecture.

Where do we begin? Quite simply, several questions need to be answered before you start down this path. For instance, it’s important to understand how the data lake is related to your existing enterprise data warehouse, how they work together, and quite possibly the most important question is “What best practices should be leveraged to ensure the resulting strategy drives business value?”

A recent white paper written by CITO Research and sponsored by Teradata and Hortonworks, takes a close look at the data lake and provides answers to all of the above questions, and then some. Without giving away too much of the detail, I thought I would capture a few of the points that impress me most in this paper.

In fact, the data lake has come a long way since its initial entry into the big data scene. Its first iteration included several limitations, making it slightly daunting to general users. The original data lakes were batch-oriented, offering very limited abilities for user interaction with the data, and expertise with MapReduce and other scripting and query capabilities were absolutely necessary for success. Those factors, among others, limited its ability to be widely adopted. Today, however, the landscape is changing. With the arrival of Hadoop 2, and more specifically the release 2.1 of Hortonworks, data lakes are evolving. New Hadoop projects bolstered better resource management and application multi-tenancy allowing multiple workloads on the same cluster that enable users from different business units within organizations to effectively refine, explore, and enrich data. Today, enterprise Hadoop is a full-fledged data lake, with new capabilities being added all the time.

While the capabilities of a data lake evolved over the last few years, so did the world of big data. Companies everywhere started creating data lakes to complement the capabilities of their data warehouses but now must also tackle creating a logical data warehouse in which the data lake and the enterprise data warehouse can be maximized individually -- and yet support each other in the best way possible as well.

The enterprise data warehouse plays a critical role in solving big data challenges, and together with the data lake, the possibilities can deliver real business value. The enterprise data warehouse is a highly designed sophisticated system that provides a single version of the truth that can be used over and over again. And, like a data lake, it supports batch workloads. Unlike a data lake, the enterprise data warehouse also supports simultaneous use by thousands of concurrent users performing reporting and analytic tasks.

There are several impressive uses for a data lake and several beneficial outcomes can result. It is very worthwhile to learn more about data lakes and how they can help you to store and process data at low cost. You can also learn how to create a distributed form of analytics, or learn how the data lake and the enterprise data warehouse have started to work together as a hybrid, unified system that empowers users to ask questions that can be answered by more data and more analytics with less effort. To start learning about these initiatives, download our whitepaper here.

By Cesar Rojas - bio link 

Take a Giant Step with Teradata QueryGrid

Posted on: April 29th, 2014 by Dan Graham No Comments

 

Teradata 15.0 has gotten tremendous interest from customers and the press because it enables SQL access to native JSON data. This heralds the end of the belief that data warehouses can’t handle unstructured data. But there’s an equally momentous innovation in this release called Teradata QueryGrid.

What is Teradata QueryGrid?
In Teradata’s Unified Data Architecture (UDA), there are three primary platforms: the data warehouse, the discovery platform, and the data platform. The huge gray arrows represent data flowing between these systems. A year or two ago, these arrows were extract files moved in batch mode.

Teradata QueryGrid is both a vision and a technology. The vision --simply said-- is that a business person connected to the Teradata Database or Aster Database can submit a single SQL query that joins data together from a second system for analysis. There’s no need to plead with the programmers to extract data and load it into another machine. The business person doesn’t have to care where the data is – they can simply combine relational tables in Teradata with tables or flat files found in Hadoop on demand. Imagine a data scientist working on an Aster discovery problem and needing data from Hadoop. By simply adjusting the queries she is already using, Hadoop data is fetched and combined with tables in the Aster Database. That should be a huge “WOW” all by itself but let’s look further.

You might be saying “That’s not new. We’ve had data virtualization queries for many years.” Teradata QueryGrid is indeed a form of data virtualization. But Teradata QueryGrid doesn’t suffer from the normal limitations of data virtualization such as slow performance, clogged networks, and security concerns.

Today, the vision is translated into reality as connections between Teradata Database and Hadoop as well as Aster Databases and Hadoop. Teradata QueryGrid also connects the Teradata Data Warehouse to Oracle databases. In the near future, it will extend to all combinations of UDA servers such as Teradata to Aster, Aster to Aster, Teradata to Teradata, and so on.

Seven League Boots for SQL
With QueryGrid, you can add a clause in a SQL statement that says “Call up Hadoop, pass Hive a SQL request, receive the Hive results, and join it to the data warehouse tables.” Running a single SQL statement spanning Hadoop and Teradata is amazing in itself – a giant step forward. Notice too that all the database security, advanced SQL functions, and system management in the Teradata or Aster system is supporting these queries. The only effort required is for the database administrator to set up a “view” that connects the systems. It’s self-service for the business user after that. Score: complexity zero, business users one.

Parallel Performance, Performance, Performance
Historically, data virtualization tools lack the ability to move data between systems in parallel. Such tools send a request to a remote database and the data comes back serially through an Ethernet wire. Teradata QueryGrid is built to connect to remote systems in parallel and exchange data through many network connections simultaneously. Wanna move a terabyte per minute? With the right configurations it can be done. Parallel processing by both systems makes this incredibly fast. I know of no data virtualization system that does this today.

Inevitably, the Hadoop cluster will have a different number of servers compared to the Teradata or Aster MPP systems. The Teradata and Aster systems start the parallel data exchange by matching up units of parallelism between the two systems. That is, all the Teradata parallel workers (called AMPs) connect to a buddy Hadoop worker node for maximum throughput. Anytime the configuration changes, the workers match-up changes. This is non-trivial rocket-science class technology. Trust me – you don’t want to do this yourself and the worst situation would be to expose this to the business users. But Teradata QueryGrid does it all for you completely invisible to the user.

Put Data in the Data Lake FAST
Imagine complex predictive analytics using R® or SAS® are run inside the Teradata data warehouse as part of a merger and acquisition project. In this case, we want to pass this data to the Hadoop Data Lake where it is combined with temporary data from the company being acquired. With moderately simple SQL stuffed in a database view, the answers calculated by the Teradata Database can be sent to Hadoop to help finish up some reports. Bi-directional data exchange is another breakthrough in the Teradata Query Grid, new in release 15.0. The common thread in all these innovations is that the data moves from the memory of one system to the memory of the other. No extracts, no landing the data on disk until the final processing step – and sometimes not even then.

Push-down Processing
What we don’t want to do is transfer terabytes of data from Hadoop and throw away 90% of it since it’s not relevant. To minimize data movement, Teradata QueryGrid sends the remote system SQL filters that eliminate records and columns that aren’t needed. An example constraint could be “We only want records for single women age 30-40 with an average account balance over $5000. Oh, and only send us the account number, account type, and address.” This way, the Hadoop system discards unnecessary data so it doesn’t flood the network with data that will be thrown away. After all the processing is done in Hadoop, data is joined in the data warehouse, summarized, and delivered to the user’s favorite business intelligence tool.

Teradata QueryGrid delivers some important benefits:
• It’s easy to use: any user with any BI tool can do it
• Low DBA labor: it’s mostly setting up views and testing them once
• High performance: reducing hours to minutes means more accuracy and faster turnaround for demanding users
• Cross-system data on demand: don’t get stuck in programmer’s work queue
• Teradata/Aster strengths: security, workload management, system management
• Minimum data movement improves performance and reduces network use
• Move the processing to the data

Big data is now taking giant steps through your analytic architecture --frictionless, invisible, and in parallel. Nice boots!

LA kicks off the 2014 Teradata User Group Season

Posted on: April 22nd, 2014 by Guest Blogger No Comments

 

By Rob Armstrong,  Director, Teradata Labs Customer Briefing Team

After presenting for years at the Teradata User Group meetings, it was refreshing to see some changes in this roadshow.  While I had my usual spot on the agenda to present Teradata’s latest database release (15.0), we had some hot new topics including Cloud and Hadoop, some more business level folks were there, more companies researching Teradata’s technology (vs. just current users) and there was a hands-on workshop the following day for the more technically inclined looking to walk through real world Unified Data Architecture™ (UDA) use cases of a Teradata customer.  While LA tends to be a smaller venue than most, the room was packed and we had 40% more attendees compared with last year.

I would be remiss if I did not give a big Thanks to the partner sponsors of the user group meeting.  In LA we had Hortonworks and Dot Hill as our gold and silver sponsors.  I took a few minutes to chat with them and found out some interesting upcoming items.  Most notably, Lisa Sensmeier from Hortonworks talked to me about Hadoop Summit which is coming up in San Jose, June 3-5th.  Jim Jonez, from Dot Hill, gave me the latest on their newest “Ultra Large” disk technology where they’ll have 48 1 TB drives in a single 2U rack.  It is not in the Teradata line up yet, but we are certainly intrigued for the proper use case.

Now, I’d like to take a few minutes to toot my own horn about the Teradata Database 15.0 presentation that had some very exciting elements to help change the way users get to and analyze all of their data.  You may have seen the recent news releases, but if not, here is a quick recap:

  • Teradata 15.0 continues our Unified Data architecture with the new Teradata QueryGrid.  This is the new environment to define and access data from Teradata to other data servers such as Apache Hadoop (Hortonwoks), Teradata Aster Discovery Platform, Oracle, and others.  This lays the foundation for an extension to even more foreign data servers.  15.0 simplifies the whole definition and usage as well as added bi-directional and predicate pushdown.  In a related session, Cesar Rojas provided some good recent examples of customers taking advantage of the entire UDA ecosystem where data from all of the Teradata offerings were used together to generate new actions.
  • The other big news in 15.0 is the inclusion of the JSON data type.  This allows customers to store direct JSON documents in a column and then apply “schema on read” for much greater flexibility with greatly reduced IT resources.  As the JSON document changes, there is no table or database changes necessary to absorb the new content.

Keep your eyes and ears open for the next Teradata User Group event coming your way, or better yet, just go to the webpage: http://www.teradata.com/user-groups/ to see where the bus stops next and to register.  The TUGs are free of charge.  Perhaps we’ll cross paths as I make the circuit? Until then, ‘Keep Calm and Analyze On’ (as the cool kids say).

 Since joining Teradata in 1987, Rob Armstrong has worked in all areas of the data warehousing arena.  He has gone from writing and supporting the database code to implementing and managing systems at some of Teradata’s largest and most innovative customers.  Currently Rob provides sales and marketing support by traveling the globe and evangelizing the Teradata solutions.

 

The best Strata session that I attended was the overview Kurt Brown gave of the Netflix data platform, which contained hype-deflating lessons and many chestnuts of tech advice straight from one of the most intense computing environments on the planet.

Brown, who as a director leads the design and implementation of the data platform, had a cheerful demeanor but demonstrated ruthless judgment and keen insight in his assessment of how various technologies serve the goals of Netflix. It was interesting to me how dedicated he was to both MPP SQL technology and to Apache™ Hadoop.

I attended the session with Daniel Graham, Technical Marketing Specialist of Teradata, who spoke with me afterward about the implications of the Netflix architecture and Brown’s point of view.

SQL Vs Hadoop
Brown rejected the notion that it was possible to build a complete data platform exclusively using either SQL technology or Hadoop alone. In his presentation, Brown explained how Netflix made great use of Hadoop, used Hive for various purposes, and had an eye on Presto, but also couldn’t live without Teradata and Microstrategy as well.

Brown recalled a conversation in which another leader of a data platform explained that he was discarding all his data warehouse technology and going to put everything on Hive. Brown’s response, “Why would you ever want to do that?”

While Brown said he enjoyed the pressure that open source puts on commercial vendors to improve, he was dedicated to using whatever technology could provide answers to questions in the most cost-effective manner. Brown said he was especially pleased that Teradata was going to be able to support a cloud-based implementation that could run at scale. Brown said that Netflix had upwards of 5 petabytes of data in the cloud, all stored on Amazon S3.

After the session, I pointed out to Graham that the pattern in evidence at Netflix and most of the companies who are acknowledged as the leaders in big data, mimics the recommendation of the white paper “Optimize the Value of All Your Enterprise Data” that provides an overview of the Teradata Unified Data Architecture™.

The Unified Data Architecture recommends that that the data that has the most “business value density” be stored in an enterprise data warehouse powered by MPP SQL. This data is used most often by the most users. Hadoop is used as a data refinery to process flat files or NoSQL data in batch mode.

Netflix is a big data companies that arrived at this pattern by adding SQL to a Hadoop infrastructure. Many well-known users of huge MPP SQL installations have added Hadoop.

“Data doesn’t stay unstructured for long. Once you have distilled it, it usually has a structure that is well-represented by flat files,” said Teradata's Graham. “This is the way that the canonical model of most enterprise activity is stored. Then the question is: How you ask questions of that data? There are numerous ways to make this easy for users, but almost all of those ways pump out SQL that then is used to grab the data that is needed.”

Replacing MPP SQL with Hive or Presto is a non-starter because to really support hundreds or thousands of users who are pounding away at a lot of data, you need a way to provide speedy and optimized queries and also to manage the consumption of the shared resources.

“For over 35 years, Teradata has been working on making SQL work at scale for hundreds or thousands of people at a time,” said Graham. “It makes perfect sense to add SQL capability to Hadoop, but it will be a long time, perhaps a decade or more, before you will get the kind of query optimization and performance that Teradata provides. The big data companies use Teradata and other MPP SQL systems because they are the best tool for the job for making huge datasets of high business value density available to an entire company.”

Efforts such as Tez and Impala will clearly move Hive’s capability forward. The question is how far forward and how fast. We will know that victory has been achieved when Netflix, which uses Teradata in a huge cloud implementation, is able to support their analytical workloads with other technology.

Graham predicts that in 5 years, Hadoop will be a good data mart but will still have trouble with complex parallel queries.

“It is common for a product like Microstrategy to pump out SQL statements that may be 10, 20, or even 50 pages long,” said Graham. “When you have 5 tables, the complexity of the queries could be 5 factorial. With 50 tables, that grows to 50 factorial. Handling such queries is a 10- or 20-year journey. Handling them at scale is a feat that many companies can never pull off.”

Graham acknowledges the need for an MPP SQL data warehouse extended to support data discovery, e.g. Teradata Aster Discovery Platform, along with the extensions for using Hadoop and graph analytics through enhanced SQL, is needed by most businesses.

Teradata is working to demonstrate that the power of this collection of technology can address some of the unrealistic enthusiasm surrounding Hadoop.

By: Dan Woods, Forbes Blogger and Co-Founder of Evolved Media

 

In years past, Strata has celebrated the power of raw technology, so it was interesting to note how much the keynotes on Wednesday focused on applications, models, and how to learn and change rather than on speeds and feeds.

After attending the keynotes and some fascinating sessions, it seems clear that the blinders are off. Big data and data science have been proven in practice by many innovators and early adopters. The value of new forms of data and methods of analysis are so well established that there’s no need for exaggerated claims. Hadoop can do so many cool things that it doesn’t have to pretend to do everything, now or in the future. Indeed, the pattern in place at Facebook, Netflix, the Obama Campaign, and many other organizations with muscular data science and engineering departments is that MPP SQL and Hadoop sit side by side, each doing what they do best.

In his excellent session, Kurt Brown, Director, Data Platform at Netflix, recalled someone explaining that his company was discarding its data warehouse and putting everything on Hive. Brown responded, “Why would you want to do that?” What was obvious to Brown, and what he explained at length, is that the most important thing any company can do is assemble technologies and methods that serve its business needs. Brown demonstrated the logic of creating a broad portfolio that serves many different purposes.

Real Value for Real People
The keynotes almost all celebrated applications and models. Vendors didn’t talk about raw power, but about specific use cases and ease-of-use. Farrah Bostic, a marketing and product design consultant, recommended ways to challenge assumptions and create real customer intimacy. This was a key theme: Use the data to understand a person in their terms not yours. Bostic says you will be more successful if you focus on creating value for the real people who are your customers instead of extracting value from some stilted and limited model of a consumer. A skateboarding expert and a sports journalist each explained models and practices for improving performance. This is a long way from the days when a keynote would show a computer chewing through a trillion records.

Geoffrey Moore, the technology and business philosopher, was in true provocative form. He asserted that big data and data science are well on their way to crossing the chasm because so many upstarts pose existential threats to established businesses. This pressure will force big data to cross the chasm and achieve mass adoption. His money quote: "Without big data analytics, companies are blind and deaf, wandering out onto the Web like deer on the freeway.”

An excellent quote to be sure, but it goes too far. Moore would have been more accurate and less sensational if he said, “Without analytics,” not “Without big data analytics.” The reason that MPP SQL and Hadoop have made such a perfect pair is because more than one type of data and method of analysis is needed. Every business needs all the relevant data it can get to understand the people it does business with.

The Differentiator: A Culture of Analytics
The challenge I see companies facing lies in creating a culture of analytics. Tom Davenport has been a leader in promoting analytics as a means to competitive advantage. In his keynote at Strata Rx in September 2013, Davenport stressed the importance of integration.

In his session at Strata this year, Bill Franks, Chief Analytics Officer at Teradata, put it quite simply, "Big data must be an extension of an existing analytics strategy. It is an illusion that big data can make you an analytics company."

When people return from Strata and roll up their sleeves to get to work, I suspect that many will realize that it’s vital to make use of all the data in every way possible. But one person can only do so much. For data to have the biggest impact, people must want to use it. Implementing any type of analytics provides supply. Leadership and culture create demand. Companies like CapitalOne and Netflix don’t do anything without looking at the data.

I wish there were a shortcut to creating a culture of analytics, but there isn’t, and that’s why it’s such a differentiator. Davenport’s writings are probably the best guide, but every company must figure this out based on its unique situation.

Supporting a Culture of Analytics
If you are a CEO, your job is to create a culture of analytics so that you don’t end up like Geoffrey Moore’s deer on the freeway. But if you have Kurt Brown’s job, you must create a way to use all the data you have, to use the sweet spot of each technology to best effect, and to provide data and analytics to everyone who wants them.

At a company like Netflix or Facebook, creating such a data supply chain is a matter of solving many unique problems connected with scale and advanced analytics. But for most companies, common patterns can combine all the modern capabilities into a coherent whole.

I’ve been spending a lot of time with the thought leaders at Teradata lately and closely studying their Unified Data Architecture. Anyone who is seeking to create a comprehensive data and analytics supply chain of the sort in use at leading companies like Netflix should be able to find inspiration in the UDA, as described in a white paper called “Optimizing the Business Value of All Your Enterprise Data.”

The paper does excellent work in creating a framework for data processing and analytics that unifies all the capabilities by describing four use cases: the file system, batch processing, data discovery, and the enterprise data warehouse. Each of these use cases focuses on extracting value from different types of data and serving different types of users. The paper proposes a framework for understanding how each use case creates data with different business value density. The highest volume interaction takes place with data of the highest business value density. For most companies, this is the enterprise data warehouse, which contains a detailed model of all business operations that is used by hundreds or thousands of people. The data discovery platform is used to explore new questions and extend that model. Batch processing and processing of data in a file system extract valuable signals that can be used for discovery and in the model of the business.

While this structure doesn’t map exactly to that of Netflix or Facebook, for most businesses, it supports the most important food groups of data and analytics and shows how they work together.

The refreshing part of Strata this year is that thorny problems of culture and context are starting to take center stage. While Strata will always be chock full of speeds and feeds, it is even more interesting now that new questions are driving the agenda.

By: Dan Woods, Forbes Blogger and Co-Founder of Evolved Media