Unified Data Architecture

The Benefits and Evolution of the Hadoop Appliance

Posted on: July 9th, 2015 by Chris Twogood No Comments

 

Running Hadoop on an appliance offers significant benefits, but as Hadoop workloads become more sophisticated, so too must the appliance. That’s exactly why we’re releasing the ‘new’ Teradata Appliance for Hadoop 5. Our new appliance has evolved alongside Hadoop usage scenarios while giving IT organizations more freedom of choice to run diverse workloads. Running Hadoop on an appliance makes more sense than ever before.

If you’re running – or thinking about running – Hadoop on an appliance, you’re not alone. According to an ESG survey reported on by SearchDataCenter.com, 21% of IT organizations are considering dedicated analytics appliances. That’s the same percentage of organizations that are considering public cloud solutions and double those considering a public/private hybrid deployment. What is driving the adoption of Hadoop appliances?

5 Key Benefits of Running Hadoop on an Appliance

Organizations that choose to deploy Hadoop on an appliance versus rolling out their own solution realize five important benefits.

  1. Hadoop is delivered ready to run.

We’ve heard industry experts say that it can take IT organizations six to eight months to roll out a Hadoop implementation on their own. With a Teradata appliance, we’ve done all the hard work in terms of installing and configuring multiple types of software as well as installing and configuring the operating system, networking and the like. You simply plug it in, and within days you are up and running.

  1. We’vebuilt high availability into our Hadoop appliances.

The Teradata Vital Infrastructure (TVI) proactively detects and resolves incidents. In fact, up to 72% of all hardware- and software-related incidents are detected and resolved by TVI before the customer even knows about them. We also run BYNET over InfiniBand, which delivers automated network load balancing, automated network failover, redundancy across two active fabrics, and multiple levels of network isolation. These features in Teradata Appliance for Hadoop 5 deliver the high availability IT organizations need in an enterprise-grade solution.

  1. It is Unified Data Architecture ready.

It’s not enough to just efficiently deploy Hadoop. IT organizations must be able to efficiently deploy Hadoop as a seamless part of an interconnected analytics ecosystem. The UDA-ready Hadoop appliance becomes an integral part of the organization’s larger data fabric, with BYNET over InfiniBand interconnect between Hadoop, the Integrated Data Warehouse and Aster big data analytics, and software integration such as QueryGrid, Viewpoint, TDCH, and Smart Loader.

  1. Single vendor support.

An appliance replaces the multiple support contracts IT organizations have with their hardware provider, Hadoop vendor, OS vendor, and various utilities, with a single “hand to shake.” If there’s any problem, one phone call puts you in touch with Teradata’s world-class, 24/7, multi-language support for the entire solution stack. IT organizations are seeing increasing value in this benefit as the Hadoop ecosystem has many moving parts associated with it, and single vendor support provides peace of mind.

  1. Running Hadoop on an appliance lowers your total cost of ownership (TCO)

The cost of a Hadoop appliance includes much more than the hardware the software runs on. There are also costs associated with configuring the network, installing the OS, configuring the disks, installing the Hadoop environment, tuning the Hadoop environment, and testing. The costs for doing all this work internally add up, making the TCO of an appliance even more attractive.

What’s New with Teradata Appliance for Hadoop 5?

In addition to these five benefits, Teradata Appliance for Hadoop 5 delivers freedom of choice to run a variety of workloads. IT organizations now have more options when they run Hadoop on Teradata Appliance 5.

Recognizing that Hadoop workloads are diverse and evolving, Teradata Appliance for Hadoop 5 is available in three flexible configurations, enabling customers to select the configuration that best fits their workloads.

  • Performance configuration. For real-time processing and other workloads that require significant CPU, IO, and memory, we offer the performance configuration. This computational intensive configuration enables organizations to run emerging Hadoop workloads such as streaming, Spark, and SQL on Hadoop. With 24 cores, this configuration has more cores per node. It also has 512TB of RAM, 24 storage disks and 1.2TB drives.
  • Capacity configuration. The capacity configuration allows IT organizations to drive down the cost per terabyte. It is designed for heavy duty, long-running batch jobs as well as long-term archival and storage. It comes with 128- to 256TB RAM and 4TB disk drives.
  • Balance configuration. The balance configuration sits between the performance and capacity configurations, allowing IT organizations to strike the right balance for ETL and analytics jobs. The balance configuration features 24 cores and a 4TB capacity drive.

Learn more about Teradata’s Portfolio for Hadoop.

 

00-11-HC-QOC-BQ-DataLabZoomed in view of Data Analytics Graph
(Healthcare Example)

<---- Click on image to view GRAPH ANIMATION

In the first part of this two part blog series, I discussed the competitive importance of cross-functional analytics [1]. I also proposed that by treating Data and Analytics as a network of interconnected nodes in Gephi [2], we can examine a statistical metric for analytics called Degree Centrality [3]. In this second part of the series I will now examine parts of the sample Healthcare industry graph animation in detail and draw some high level conclusions from the Degree Centrality measurement for analytics.

In this sample graph [4], link analysis was performed on a network of 3428 nodes and 8313 directed edges. Majority of the nodes represent either Analytics or Source Data Elements. Many Analytics in this graph tend to require data from multiple source systems resulting in cross functional Degree Centrality (connectedness). Some of the Analytics in this study display more Degree Centrality than others.

The zoomed-in visualization starts with a single source system (green) with its data elements (cyan). Basic function specific analytics (red) can be performed with this single Clinical source system data. Even advanced analytics (Text Analysis) can be applied to this single source of data to yield function specific insights.

But data and business never exist separately in isolation. Usually cross-functional analytics emerge with users looking to gain additional value from combining data from various source systems. Notice how these new analytics are using data from source systems in multiple functional areas such as Claims and Membership. Such cross functional data combination or data coupling can now be supported at various levels of sophistication. For instance, data can be loosely coupled for analysis with data virtualization, or if requirements dictate, it can be tightly coupled within a relational Integrated Data Warehouse.

As shown in the graph, even advanced analytics such as Time Series and Naïve Bayes can utilize data from multiple source systems. A data platform that can loosely couple or combine data for such cross-functional advanced analytics can be critical for efficient discovering insights from new sources of data (see discovery platform). More importantly as specific advanced analytics are eventually selected for operationalization, a data platform needs to easily integrate results and support easy access regardless of where the advanced analytics are being performed.

Degree Ranking for sample Analytics from the Healthcare Industry Graph

Degree Analytic Label
3 How can we reduce manual effort required to evaluate physician notes and medical records in conjunction with billing procedure codes?
10 How can number of complaints to Medicare be reduced in an effort to improve the overall STAR rating?
22 What is the ratio of surgical errors to hospital patients? And total medical or surgical errors? (Provider, Payer)
47 What providers are active in what networks and products? What is the utilization? In total, by network, by product
83 What are the trends over time for utilization for patients who use certain channels?
104 What is the cost of care PMPM?   For medical, For Pharmacy, Combined.   How have clinical interventions impacted this cost over time?

The sample analytics listed above demonstrate varying degree of cross-functional Degree Centrality and should be supported with varying level of data coupling. This can range from non-coupled data to loosely coupled data to tightly coupled data. As the number of Analytics with cross-functional Degree Centrality cluster together it may indicate a need to employ tighter data coupling or data integration to drive consistency in the results being obtained. The clustering of Analytics may also be an indication of an emerging need for a data mart or extension of Integrated Data Warehouse that can be utilized by a broader audience.

In-Degree Ranking for sample Data Elements from the Healthcare Industry Graph

In-Degree Source Element
46 Accounts Receivable*PROVIDER BILL-Bill Payer Party Id
31 Clinical*APPLICATION PRODUCT-Product Id
25 Medical Claims*CLAIM-Claim Num
25 Membership*MEMBER-Agreement Id

Similarly if Data start to show high Degree Centrality it may be an indication for re-assessing whether there is a need for tighter coupling to drive consistency and enable broader data reuse. When the In-Degree metric is applied, Data being used by more Analytics appears larger on the graph and is a likely candidate for tighter coupling. To support data design for tighter coupling from a cross functional and even a cross industry perspective Teradata offers reference data model blueprints by industry. (See Teradata Data Models)

This calls for a data management ecosystem with data analytics platforms that can easily harvest this cross-functional Degree Centrality of Analytics and Data. Such a data management ecosystem would support varying degrees of data coupling, varying types of analytics, and varying types of data access based on data users. (Learn more about Teradata’s Unified Data Architecture.)

The analysis described above is exploratory and by no means a replacement for a thorough architectural assessment. Eventually the decision to employ the right degree of data coupling should rest on the full architecture requirements including but not limited to data integrity, security, or business value.

In conclusion, what our experiences have taught us in the past will still hold true for the future:
• Data sources are exponentially more valuable when combined or integrated with other data sets
• To maintain sustained competitive advantage business has to continue to search for insights building on the cross-functional centrality of data
• Unified data management ecosystems can now harvest this cross-functional centrality of data at a lower cost with efficient support for varying levels of data integration, analytic types, and users

Contact Teradata to learn more about how Teradata technology, architecture, and industry expertise can efficiently and effectively harvest this centrality of Data and Analytics.

[1] https://hbr.org/2014/11/how-smart-connected-products-are-transforming-competition

[2] Gephi is a tool to explore and understand graphs. It is a complementary tool to traditional statistics.

[3] Degree centrality is defined as the number of links incident upon a node (i.e., the number of ties that a node has).

[4] This specific industry example is illustrative and subject to the limitations of assumptions and quality of the sample data mappings used for this study.

Ojustwin blog bio

 

 

Ojustwin Naik (MBA, JD) is a Director with 15 years of experience in planning, development, and delivery of Analytics. He has experience across multiple industries and is passionate at nurturing a culture of innovation based on clarity, context, and collaboration.

--

 

High Level Data Analytics Graph
(Healthcare Example)

 <---- Click on image to view GRAPH ANIMATION

Michael Porter, in an excellent article in the November 2014 issue of the Harvard Business Review[1], points out that smart connected products are broadening competitive boundaries to encompass related products that meet a broader underlying need. Porter elaborates that the boundary shift is not only from the functionality of discrete products to cross-functionality of product systems, but in many cases expanding to a system of systems such as a smart home or smart city.

So what does all this mean from a data perspective? In that same article, Porter mentions that companies seeking leadership need to invest in capturing, coordinating, and analyzing more extensive data across multiple products and systems (including external information). The key take-away is that the movement of gaining competitive advantage by searching for cross-functional or cross-system insights from data is only going to accelerate and not slow down. Exploiting cross-functional or cross-system centrality of data better than anyone else will continue to remain critical to achieving a sustainable competitive advantage.

Understandably, as technology changes, the mechanisms and architecture used to exploit this cross-system centrality of data will evolve. Current technology trends point to a need for a data & analytic-centric approach that leverages the right tool for the right job and orchestrates these technologies to mask complexity for the end users; while also managing complexity for IT in a hybrid environment. (See this article published in Teradata Magazine.)

As businesses embrace the data & analytic-centric approach, the following types of questions will need to be addressed: How can business and IT decide on when to combine which data and to what degree? What should be the degree of data integration (tight, loose, non-coupled)? Where should the data reside and what is the best data modeling approach (full, partial, need based)? What type of analytics should be applied on what data?

Of course, to properly address these questions, an architecture assessment is called for. But for the sake of going beyond the obvious, one exploratory data point in addressing such questions could be to measure and analyze the cross-functional/cross-system centrality of data.

By treating data and analytics as a network of interconnected nodes in Gephi[2], the connectedness between data and analytics can be measured and visualized for such exploration. We can examine a statistical metric called Degree Centrality[3] which is calculated based on how well an analytic node is connected.

The high level sample data analytics graph demonstrates the cross-functional Degree Centrality of analytics from an Industry specific perspective (Healthcare). It also amplifies, from an industry perspective, the need for organizations to build an analytical ecosystem that can easily harness this cross-functional Degree Centrality of data analytics. (Learn more about Teradata’s Unified Data Architecture.)

In the second part of this blog post series we will walk through a zoomed-in view of the graph, analyze the Degree Centrality measurements for sample analytics, and draw some high-level data architecture implications.

[1] https://hbr.org/2014/11/how-smart-connected-products-are-transforming-competition

[2] Gephi is a tool to explore and understand graphs. It is a complementary tool to traditional statistics.

[3] Degree centrality is defined as the number of links incident upon a node (i.e., the number of ties that a node has).

Ojustwin blog bio

Ojustwin Naik (MBA, JD) is a Director with 15 years of experience in planning, development, and delivery of Analytics. He has experience across multiple industries and is passionate at nurturing a culture of innovation based on clarity, context, and collaboration.

LA kicks off the 2014 Teradata User Group Season

Posted on: April 22nd, 2014 by Guest Blogger No Comments

 

By Rob Armstrong,  Director, Teradata Labs Customer Briefing Team

After presenting for years at the Teradata User Group meetings, it was refreshing to see some changes in this roadshow.  While I had my usual spot on the agenda to present Teradata’s latest database release (15.0), we had some hot new topics including Cloud and Hadoop, some more business level folks were there, more companies researching Teradata’s technology (vs. just current users) and there was a hands-on workshop the following day for the more technically inclined looking to walk through real world Unified Data Architecture™ (UDA) use cases of a Teradata customer.  While LA tends to be a smaller venue than most, the room was packed and we had 40% more attendees compared with last year.

I would be remiss if I did not give a big Thanks to the partner sponsors of the user group meeting.  In LA we had Hortonworks and Dot Hill as our gold and silver sponsors.  I took a few minutes to chat with them and found out some interesting upcoming items.  Most notably, Lisa Sensmeier from Hortonworks talked to me about Hadoop Summit which is coming up in San Jose, June 3-5th.  Jim Jonez, from Dot Hill, gave me the latest on their newest “Ultra Large” disk technology where they’ll have 48 1 TB drives in a single 2U rack.  It is not in the Teradata line up yet, but we are certainly intrigued for the proper use case.

Now, I’d like to take a few minutes to toot my own horn about the Teradata Database 15.0 presentation that had some very exciting elements to help change the way users get to and analyze all of their data.  You may have seen the recent news releases, but if not, here is a quick recap:

  • Teradata 15.0 continues our Unified Data architecture with the new Teradata QueryGrid.  This is the new environment to define and access data from Teradata to other data servers such as Apache Hadoop (Hortonwoks), Teradata Aster Discovery Platform, Oracle, and others.  This lays the foundation for an extension to even more foreign data servers.  15.0 simplifies the whole definition and usage as well as added bi-directional and predicate pushdown.  In a related session, Cesar Rojas provided some good recent examples of customers taking advantage of the entire UDA ecosystem where data from all of the Teradata offerings were used together to generate new actions.
  • The other big news in 15.0 is the inclusion of the JSON data type.  This allows customers to store direct JSON documents in a column and then apply “schema on read” for much greater flexibility with greatly reduced IT resources.  As the JSON document changes, there is no table or database changes necessary to absorb the new content.

Keep your eyes and ears open for the next Teradata User Group event coming your way, or better yet, just go to the webpage: http://www.teradata.com/user-groups/ to see where the bus stops next and to register.  The TUGs are free of charge.  Perhaps we’ll cross paths as I make the circuit? Until then, ‘Keep Calm and Analyze On’ (as the cool kids say).

 Since joining Teradata in 1987, Rob Armstrong has worked in all areas of the data warehousing arena.  He has gone from writing and supporting the database code to implementing and managing systems at some of Teradata’s largest and most innovative customers.  Currently Rob provides sales and marketing support by traveling the globe and evangelizing the Teradata solutions.

 

The best Strata session that I attended was the overview Kurt Brown gave of the Netflix data platform, which contained hype-deflating lessons and many chestnuts of tech advice straight from one of the most intense computing environments on the planet.

Brown, who as a director leads the design and implementation of the data platform, had a cheerful demeanor but demonstrated ruthless judgment and keen insight in his assessment of how various technologies serve the goals of Netflix. It was interesting to me how dedicated he was to both MPP SQL technology and to Apache™ Hadoop.

I attended the session with Daniel Graham, Technical Marketing Specialist of Teradata, who spoke with me afterward about the implications of the Netflix architecture and Brown’s point of view.

SQL Vs Hadoop
Brown rejected the notion that it was possible to build a complete data platform exclusively using either SQL technology or Hadoop alone. In his presentation, Brown explained how Netflix made great use of Hadoop, used Hive for various purposes, and had an eye on Presto, but also couldn’t live without Teradata and Microstrategy as well.

Brown recalled a conversation in which another leader of a data platform explained that he was discarding all his data warehouse technology and going to put everything on Hive. Brown’s response, “Why would you ever want to do that?”

While Brown said he enjoyed the pressure that open source puts on commercial vendors to improve, he was dedicated to using whatever technology could provide answers to questions in the most cost-effective manner. Brown said he was especially pleased that Teradata was going to be able to support a cloud-based implementation that could run at scale. Brown said that Netflix had upwards of 5 petabytes of data in the cloud, all stored on Amazon S3.

After the session, I pointed out to Graham that the pattern in evidence at Netflix and most of the companies who are acknowledged as the leaders in big data, mimics the recommendation of the white paper “Optimize the Value of All Your Enterprise Data” that provides an overview of the Teradata Unified Data Architecture™.

The Unified Data Architecture recommends that that the data that has the most “business value density” be stored in an enterprise data warehouse powered by MPP SQL. This data is used most often by the most users. Hadoop is used as a data refinery to process flat files or NoSQL data in batch mode.

Netflix is a big data companies that arrived at this pattern by adding SQL to a Hadoop infrastructure. Many well-known users of huge MPP SQL installations have added Hadoop.

“Data doesn’t stay unstructured for long. Once you have distilled it, it usually has a structure that is well-represented by flat files,” said Teradata's Graham. “This is the way that the canonical model of most enterprise activity is stored. Then the question is: How you ask questions of that data? There are numerous ways to make this easy for users, but almost all of those ways pump out SQL that then is used to grab the data that is needed.”

Replacing MPP SQL with Hive or Presto is a non-starter because to really support hundreds or thousands of users who are pounding away at a lot of data, you need a way to provide speedy and optimized queries and also to manage the consumption of the shared resources.

“For over 35 years, Teradata has been working on making SQL work at scale for hundreds or thousands of people at a time,” said Graham. “It makes perfect sense to add SQL capability to Hadoop, but it will be a long time, perhaps a decade or more, before you will get the kind of query optimization and performance that Teradata provides. The big data companies use Teradata and other MPP SQL systems because they are the best tool for the job for making huge datasets of high business value density available to an entire company.”

Efforts such as Tez and Impala will clearly move Hive’s capability forward. The question is how far forward and how fast. We will know that victory has been achieved when Netflix, which uses Teradata in a huge cloud implementation, is able to support their analytical workloads with other technology.

Graham predicts that in 5 years, Hadoop will be a good data mart but will still have trouble with complex parallel queries.

“It is common for a product like Microstrategy to pump out SQL statements that may be 10, 20, or even 50 pages long,” said Graham. “When you have 5 tables, the complexity of the queries could be 5 factorial. With 50 tables, that grows to 50 factorial. Handling such queries is a 10- or 20-year journey. Handling them at scale is a feat that many companies can never pull off.”

Graham acknowledges the need for an MPP SQL data warehouse extended to support data discovery, e.g. Teradata Aster Discovery Platform, along with the extensions for using Hadoop and graph analytics through enhanced SQL, is needed by most businesses.

Teradata is working to demonstrate that the power of this collection of technology can address some of the unrealistic enthusiasm surrounding Hadoop.

By: Dan Woods, Forbes Blogger and Co-Founder of Evolved Media

 

In years past, Strata has celebrated the power of raw technology, so it was interesting to note how much the keynotes on Wednesday focused on applications, models, and how to learn and change rather than on speeds and feeds.

After attending the keynotes and some fascinating sessions, it seems clear that the blinders are off. Big data and data science have been proven in practice by many innovators and early adopters. The value of new forms of data and methods of analysis are so well established that there’s no need for exaggerated claims. Hadoop can do so many cool things that it doesn’t have to pretend to do everything, now or in the future. Indeed, the pattern in place at Facebook, Netflix, the Obama Campaign, and many other organizations with muscular data science and engineering departments is that MPP SQL and Hadoop sit side by side, each doing what they do best.

In his excellent session, Kurt Brown, Director, Data Platform at Netflix, recalled someone explaining that his company was discarding its data warehouse and putting everything on Hive. Brown responded, “Why would you want to do that?” What was obvious to Brown, and what he explained at length, is that the most important thing any company can do is assemble technologies and methods that serve its business needs. Brown demonstrated the logic of creating a broad portfolio that serves many different purposes.

Real Value for Real People
The keynotes almost all celebrated applications and models. Vendors didn’t talk about raw power, but about specific use cases and ease-of-use. Farrah Bostic, a marketing and product design consultant, recommended ways to challenge assumptions and create real customer intimacy. This was a key theme: Use the data to understand a person in their terms not yours. Bostic says you will be more successful if you focus on creating value for the real people who are your customers instead of extracting value from some stilted and limited model of a consumer. A skateboarding expert and a sports journalist each explained models and practices for improving performance. This is a long way from the days when a keynote would show a computer chewing through a trillion records.

Geoffrey Moore, the technology and business philosopher, was in true provocative form. He asserted that big data and data science are well on their way to crossing the chasm because so many upstarts pose existential threats to established businesses. This pressure will force big data to cross the chasm and achieve mass adoption. His money quote: "Without big data analytics, companies are blind and deaf, wandering out onto the Web like deer on the freeway.”

An excellent quote to be sure, but it goes too far. Moore would have been more accurate and less sensational if he said, “Without analytics,” not “Without big data analytics.” The reason that MPP SQL and Hadoop have made such a perfect pair is because more than one type of data and method of analysis is needed. Every business needs all the relevant data it can get to understand the people it does business with.

The Differentiator: A Culture of Analytics
The challenge I see companies facing lies in creating a culture of analytics. Tom Davenport has been a leader in promoting analytics as a means to competitive advantage. In his keynote at Strata Rx in September 2013, Davenport stressed the importance of integration.

In his session at Strata this year, Bill Franks, Chief Analytics Officer at Teradata, put it quite simply, "Big data must be an extension of an existing analytics strategy. It is an illusion that big data can make you an analytics company."

When people return from Strata and roll up their sleeves to get to work, I suspect that many will realize that it’s vital to make use of all the data in every way possible. But one person can only do so much. For data to have the biggest impact, people must want to use it. Implementing any type of analytics provides supply. Leadership and culture create demand. Companies like CapitalOne and Netflix don’t do anything without looking at the data.

I wish there were a shortcut to creating a culture of analytics, but there isn’t, and that’s why it’s such a differentiator. Davenport’s writings are probably the best guide, but every company must figure this out based on its unique situation.

Supporting a Culture of Analytics
If you are a CEO, your job is to create a culture of analytics so that you don’t end up like Geoffrey Moore’s deer on the freeway. But if you have Kurt Brown’s job, you must create a way to use all the data you have, to use the sweet spot of each technology to best effect, and to provide data and analytics to everyone who wants them.

At a company like Netflix or Facebook, creating such a data supply chain is a matter of solving many unique problems connected with scale and advanced analytics. But for most companies, common patterns can combine all the modern capabilities into a coherent whole.

I’ve been spending a lot of time with the thought leaders at Teradata lately and closely studying their Unified Data Architecture. Anyone who is seeking to create a comprehensive data and analytics supply chain of the sort in use at leading companies like Netflix should be able to find inspiration in the UDA, as described in a white paper called “Optimizing the Business Value of All Your Enterprise Data.”

The paper does excellent work in creating a framework for data processing and analytics that unifies all the capabilities by describing four use cases: the file system, batch processing, data discovery, and the enterprise data warehouse. Each of these use cases focuses on extracting value from different types of data and serving different types of users. The paper proposes a framework for understanding how each use case creates data with different business value density. The highest volume interaction takes place with data of the highest business value density. For most companies, this is the enterprise data warehouse, which contains a detailed model of all business operations that is used by hundreds or thousands of people. The data discovery platform is used to explore new questions and extend that model. Batch processing and processing of data in a file system extract valuable signals that can be used for discovery and in the model of the business.

While this structure doesn’t map exactly to that of Netflix or Facebook, for most businesses, it supports the most important food groups of data and analytics and shows how they work together.

The refreshing part of Strata this year is that thorny problems of culture and context are starting to take center stage. While Strata will always be chock full of speeds and feeds, it is even more interesting now that new questions are driving the agenda.

By: Dan Woods, Forbes Blogger and Co-Founder of Evolved Media

Open Your Mind to all the Data

Posted on: February 14th, 2014 by Guest Blogger No Comments

 

In technology discussions, we often forget who the big winners are. When you look at almost any technology market, if a vendor makes a certain amount, the systems integrator may make 4 or 5 times that amount selling services to configure and adapt the solution. Of course, the biggest winner is the company using the technology, with a return of 10X, 20X, or even 100X. This year, I’m attending Strata and trying to understand how to hunt down and capture that 100X return.

I’m not alone in this quest. In the past Strata was mostly a technology conference, focused on vendor or systems integrator return. But this year, it is clear that the organizers realized they must find a way to focus on the 100X return, as evidenced by the Data-driven Business track. If Strata becomes an event where people can identify and celebrate the pathway to the 100X return, attendance will skyrocket.

Most of the time the 100X return comes from using data that only applies to a specific business. If you realize those kind of returns, are you really going to come to Strata to talk about it? I don’t think so. So we won’t find sessions that provide a complete answer for your business. You will only get parts. The challenge when attending a conference like Strata is how to find all the parts, put them together, and come home with ideas for getting that 100X return.

My answer is to focus on questions, then on data, and then on technology that can find answers to the questions. There are many ways to run this playbook, but here’s one way to make it work.

Questions Achieve Focus
The problem that anyone attending Strata or any large conference faces is the huge array of vendors and sessions. One way to guide exploration is through curiosity. This maintains enthusiasm during the conference but may not give you a 100X idea. Another way is to begin with a predetermined problem in mind. This is a good approach, but it may cut off avenues that lead to a 100X result.

Remember: 100X results are usually found through experimentation. Everything with an obvious huge return is usually already being done. You have to find a way to be open to new ideas that are focused on business value. Here’s one way I’ve thought of, called the Question Game.

The Question Game is a method for aligning technology with the needs of a business. It is also a great way to organize a trip to a conference like Strata, where the ratio of information to marketing spin is high.

I came up with the Question Game while reading Walter Isaacson’s biography of Steve Jobs. Two things struck me. First was the way that Jobs and Tim Cook cared a lot about focus. Jobs held an annual offsite attended by about 100 of Apple’s top people. At the end of the session, Jobs would write up the 10 most important initiatives and then cross out 7 so the team could focus on executing just 3. Both Jobs and Cook were proud of saying no to ideas that detracted from their focus.

Here’s how Tim Cook describes it: “We are the most focused company that I know of, or have read of, or have any knowledge of. We say no to good ideas every day. We say no to great ideas in order to keep the amount of things we focus on very small in number, so that we can put enormous energy behind the ones we do choose, so that we can deliver the best products in the world.”

Because I focus most of my time figuring out how technology leaders can make magic in business, I was eager to find away to empower people to say no. At the same time, I wanted room for invention and innovation. Is it possible to explore all the technology and data out there in an efficient way focused on business needs? Yes, using the Question Game. Here’s how it works:

  • The CIO, CTO, or whoever is leading innovation surveys each business leader, asking for questions they would like to have answered and the business value of answering them
  • Questions are then ranked by value, whether monetary or otherwise

The Question Game provides a clear way for the business to express its desires. With these questions in mind, Strata becomes a hunting ground far more likely to yield a path to 100X ideas. To prepare for this hunt, list all the questions and look at the agenda to find relevant sessions.

It’s the Data, All the Data
With your questions in hand, keep reminding yourself that all the technology in the world won’t answer the questions on your list. Usually, your most valuable questions will be answered by acquiring or analyzing data. The most important thing you can do is look for sources of data to answer the high value questions.

The first place to look for data is inside your company. In most companies, it is not easy to find all the available data and even harder to find additional data that could be available with a modest effort.

Too often the search for data stops inside the company. As I pointed out in “Do You Suffer from the Data Not Invented Here Syndrome?” many companies have blinders to external data sources, which can be an excellent way to shed light on high value questions. Someone should be looking for external data as well.

Remember, it is vital to keep an open mind. The important thing is to find data that can answer high value questions, not to find any particular type of data. Strata is an excellent place to do this. Some sessions shed light on particular types of data, but I’ve found that working the room, asking people what kinds of data they use, and showing them your questions can lead to great ideas.

Finding Relevant Technology
Once you have an idea about the important questions and the relevant data, you will be empowered to focus. When you tour the vendors, you can say no with confidence when a technology doesn’t have a hope of processing the relevant data to answer a high value question. Of course, it is impossible to tell from a session description or a vendor’s website if they will be relevant to a specific question. But by having the questions in mind and showing them to people you talk to, you will greatly speed the path to finding what you need. When others see your questions, they will have suggestions you didn’t expect.

Will this method guarantee you a 100X solution every time you attend Strata? I wouldn’t make that claim. But by following this plan, you have a much better chance for a victory.

While I'm at Strata, I'll be playing the question game so I can quickly and effectively learn a huge amount about the fastest path to a data-driven business.

By: Dan Woods, Forbes Blogger and Co-Founder of Evolved Media

To Learn More:

Optimize the Business Value of All Your Enterprise Data (white paper)
Bring Dark Data into the Light (infographic)
Benefit from Any and All Data (article)
To Succeed with Big Data, Begin with the Decision in Mind (article)

Anna Littick and the Unified Data Architecture — Part 2

Posted on: October 16th, 2013 by Dan Graham 1 Comment

 

Ring ring ringtone.
Dan: “Hello. This is Dan at Teradata. How can I help you today?”

Anna: “Hi Dan. It’s Anna Littick from Sunshine-Stores calling again. Can we finish our conversation?”

Dan: “Oh yeah, hi Anna. Sure. Where did we leave off?”

Anna: “Well, you remember our new CFO – Xavier Money -- wants us to move everything to Hadoop because he thinks it’s all free. You and I were ticking through his perceptions.”

Dan: “Yes. I think got through the first two but not number 3 and 4. Here’s what I remember:
1. Hadoop replaces the data warehouse
2. Hadoop is a landing zone and archive
3. Hadoop is a database
4. Hadoop does deep analytics.”

Anna: “Yep. So how do I respond to Xavier about those two?”

Dan: “Well, I guess we should start with ‘what is a database?’ I’ll try to keep this simple. A database has these characteristics:
• High performance data access
• Robust high availability
• A data model that isolates the schema from the application
• ACID properties

There’s a lot more to a database but these are the minimums. High speed is the name of the game for databases. Data has to be restructured, indexed, with a cost-based optimizer to be fast. Hive and Impala does a little restructuring of data but are a long way off from sophisticated indexes, partitioning, and optimizers. Those things take many years – each. For example, Teradata Database has multiple kinds of indexes like join indexes, aggregate indexes, hash indexes, and sparse indexes.”

Anna: “Ouch. What about the other stuff? Does Hive or Impala have that?”

Dan: “Well, high performance isn’t interesting if the data is not available. Between planned and unplanned downtime, a database has to hit 99.99% uptime or better to be mission critical. That’s roughly 53 minutes of downtime a year. Hundreds of hardware, software, and installation features have to mature to get there. I’m guessing a well-built Hadoop cluster is around 99% uptime. But just running out of memory in an application causes the cluster to crash. There’s a lot of work to be done in Hadoop.”

“Second, isolating the application programs from the schema is opposite Hadoop’s strategic direction of schema-on-read. They don’t want fixed data types and data rules enforcement. On the upside this means Hadoop has a lot of flexibility – especially with complex data that changes a lot. On the downside, we have to trust every programmer to validate and transform every data field correctly at runtime. It’s dangerous and exciting at the same time. The schema-on-read works great with some kinds of data, but the majority of data works better with a fixed schema.”

Anna: “I’ll have to think about that one. I like the ‘no rules’ flexibility but I don’t like having to scrub the incoming data every time. I already spend too much time preparing data for predictive analytics.”

Dan: “Last is the ACID properties. It’s a complex topic you should look at on Wikipedia. It boils down to trusting the data as it’s updated. If a change is made to an account balance, ACID ensures all the changes are applied or none, that no one else can change it at the same time you do, and that the changes are 100% recoverable across any kind of failure. Imagine you and your spouse at an ATM withdrawing $500 when there’s only $600 in the account. The database can’t give both of you $500 –that’s the ACID at work. Neither Hadoop, Hive, Impala, nor any other project has plans to build the huge ACID infrastructure and become a true database. Hadoop system isn’t so good at updating data in place. ”

“According to Curt Monash ‘Developing a good DBMS requires 5-7 years and tens of millions of dollars. That’s if things go extremely well. 1’ ”

Anna: “OK, Hadoop and Hive and Impala aren’t a database. So what? Who cares what you call it?”

Dan: “Well, a lot of end users, BI tools, ETL tools, and skills are expecting Hadoop to behave like a database. That’s not fair. It was not built for that purpose. Hadoop lacks a lot of functionality not being a database but it forces Hadoop to innovate and differentiate its strengths. Let’s not forget Hadoop’s progress in basic search indexing, archival of cold data, simple reporting at scale, and image processing. We’re at the beginning of a lot of innovation and it’s exciting.”

Anna: “OK. I’ll trust you on that. What about deep analytics? That’s what I care about most.”

Dan: “So Anna, off the record, you being a data scientist and all that. Do people tease you about your name? I mean Anna Littick the data scientist? I Googled you and you’re not the only one. ”

Anna: “Yes. Some guys around here think it’s funny. Apparently childishness isn’t limited to children. So during meetings I throw words at them like Markov Chains, Neural Networks, and edges in graph partitions. They pretend to understand --they nod a lot. Those guys never tease me again. [laugh]”

Dan: “Hey, those advanced analytics you mentioned are powerful stuff. You should hear David Simmen talk at our PARTNERS conference on Sunday. He’s teaching about our new graph engine that handles millions of vertices and billions of edges. It sounds like you would enjoy it.”

Anna: “Well, it looks like have approval to go, especially since PARTNERS is here in Dallas. Enough about me. What about deep analytics in Hadoop?”

Dan: “Right. OK, well first I have to tell you we do a lot of predictive and prescriptive analytics in-database with Teradata. I suspect you’ve been using SAS algorithms in-database already. The parallelism makes a huge difference in accuracy. What you probably haven’t seen is our Aster Database where you can run map-reduce algorithms under the control of SQL for fast, iterative discovery. It can run dozens of complex analytic algorithms including map-reduce algorithms in parallel. And we just added the graph engine in our 6.0 release. I mentioned. And one thing it does that Hadoop doesn’t is you can use your BI tools, SAS procs, and map-reduce all in one SQL statement. It’s ultra cool.”

Anna: “OK. I think I’ll go to David’s session. But what about Hadoop? Can it do deep analytics?”

Dan: “Yes. Both Aster and Hadoop can run complex predictive and prescriptive analytics in parallel. They can both do statistics, random forests, Markov Chains, and all the basics like naïve Bayes and regressions. If an algorithm is hard to do in SQL, these platforms can handle it.”

Anna [impatient]: “OK. I’ll take the bait. What’s the difference between Aster and Hadoop?”

Dan: “Well, Aster has a database underneath its SQL-MapReduce so you can use the BI tools interactively. There is also a lot of emphasis on behavioral analysis so the product has things like Teradata Aster nPath time-series analysis to visualize patterns of behavior and detect many kinds of consumer churn events or fraud. Aster has more than 80 algorithms packaged with it as well as SAS support. Sorry, I had to slip that Aster commercial in. It’s in my contract --sort of. Maybe. If I had a contract.”

Anna: “ And what about Hadoop?”

Dan: “Hadoop is more of a do-it-yourself platform. There are tools like Apache Mahout2 for data mining. It doesn’t have as many algorithms as Aster so you often find yourself getting algorithms from University research or GitHub and implementing them yourself. Some Teradata customers have implemented Markov Chains on Hadoop because it’s much easier to work with than SQL for that kind of algorithm. . So data scientists have more tools than ever with Teradata in-database algorithms, Aster SQL-MapReduce, SAS, and Hadoop/Mahout and others. That’s what our Unified Data Architecture does for you – it matches workloads to the best platform for that task.”

Anna: “OK. I think I’ve got enough information to help our new CFO. He may not like me bursting his ‘free-free-free’ monastic chant. But just because we can eliminate some initial software costs doesn't mean we will save any money. I’ve got to get him thinking of the big picture for big data. You called it UDA, right?”

Dan: “Right. Anna, I’m glad I could help, if only just a little. And I’ll send you a list of sessions at Teradata PARTNERS where you can hear from experts about their Hadoop implementations – and Aster. See you at PARTNERS.”

Title

Company

Day

Time

Comment

Aster Analytics: Delivering results with R Desktop

Teradata

Sun

9:30

RevolutionR

Do’s and Don’ts of using Hadoop in practice

Otto

Sun

1:00

Hadoop

Graph Analysis with Teradata Aster Discovery Platform

Teradata

Sun

2:30

Graph

Hadoop and the Data Warehouse: When to use Which

Teradata

Sun

4:00

Hadoop

The Voices of Experience: A Big Data Panel of Experts

Otto, Wells Fargo

Wed

9:30

Hadoop

An Integrated Approach to Big Data Analytics using Teradata and Hadoop

PayPal

Wed

11:00

Hadoop

TCOD: A Framework for the Total Cost of Big Data

WinterCorp

Wed

11:00

Costs

 1 Curt Monash, DBMS development and other subjects, March 18, 2013

 

About one year ago, Teradata Aster launched a powerful new way of integrating a database with Hadoop. With Aster SQL-H™, users of the Teradata Aster Discovery Platform got the ability to issue SQL and SQL-MapReduce® queries directly on Hadoop data as if that data had been in Aster all along. This level of simplicity and performance was unprecedented, and it enabled BI & SQL analysts that knew nothing about Hadoop to access Hadoop data and discover new information through Teradata Aster.

This innovation was not a one-off. Teradata has put forward the most complete vision for a data and analytics architecture in the 21st century. We call that the Unified Data Architecture™. The UDA combines Teradata, Teradata Aster & Hadoop into a best-of-breed, tightly integrated ecosystem of workload-specific platforms that provide customers the most powerful and cost-effective environment for their analytical needs. With Aster SQL-H™, Teradata provided a level of software integration between Aster & Hadoop that was, and still is, unchallenged in the industry.

 

Teradata Unified Data Architecture™ image

Teradata Unified Data Architecture™

Today, Teradata makes another leap in making its Unified Data Architecture™ vision a reality. We are announcing SQL-H™ for Teradata, bringing the best SQL engine for data warehousing and analytics to Hadoop. From now on, Enterprises that use Hadoop to store large amounts of data will be able to utilize Teradata's analytics and data warehousing capabilities to directly query Hadoop data securely through ANSI standard SQL and BI tools by leveraging the open source Hortonworks HCatalog project. This is fundamentally the best and tightest integration between a data warehouse engine and Hadoop that exists in the market today. Let me explain why.

It is interesting to consider Teradata's approach versus alternatives. If one wants to execute SQL on Hadoop, with the intent of building Data Warehouses out of Hadoop data, there are not many realistic options. Most databases have a very poor integration with Hadoop, and require Hadoop experts to manage the overall system - not a viable option for most Enterprises due to cost. SQL-H™ removes this requirement for Teradata/Hadoop deployments. Another "option" are the SQL-on-Hadoop tools that have started to emerge; but unfortunately, there are about a decade away from becoming sufficiently mature to handle true Data Warehousing workloads. Finally, the approach of taking a database and shoving it inside Hadoop has significant issues since it suffers from the worst of both worlds – Hadoop activity has to be limited so that it doesn't disrupt the database, data is duplicated between HDFS and the database store, and performance of the database is less compared to a stand–alone version.

In contrast, a Teradata/Hadoop deployment with SQL-H™ offers the best of both worlds: unprecedented performance and reliability in the Teradata layer; seamless BI & SQL access to Hadoop data via SQL-H™; and it frees up Hadoop to perform data processing tasks at full efficiency.

Teradata is committed to being the strategic advisor of the Enterprise when it comes to Data Warehousing and Big Data. Through its Unified Data Architecture™ and today's announcement on Teradata SQL-H™, it provides even more performance, flexibility and cost-effective options to Enterprises eager to use data as a competitive advantage.

Teradata and Informatica Data Integration Optimization

Posted on: March 6th, 2013 by Data Analytics Staff No Comments

 

Teradata and Informatica created a joint offering to optimize data integration processes by leveraging the right platform for the right transformation job. The offering is based on the powerful Teradata analytical infrastructure called the Teradata Unified Data Architecture (UDA) (which includes Teradata, Teradata Aster, and Hadoop) with the combination of the Informatica PowerCenter Big Data Edition.

Today, customers are creating new processes to manage and integrate Big Data for deeper and richer analytics. Customers are also evaluating options to optimize their data integration processes for faster deployment and scalable performance at reduced costs. And because we are talking about Big Data; Hadoop is being increasingly considered for data integration processing.

But as is always the case when deploying or evaluating new technologies it is important to understand the current business and technical challenges in order to ensure the deployment of the appropriate solution. Some of the business challenges include the increasing cost of data processing as data volumes grow and new types of data emerge and the risk of deploying the wrong platform. Technical challenges related to big data analytics include finding the right skills, the complexity of deploying new technologies into production data centers, the need for more real-time data, and the lack of enterprise capabilities such as security.

The new Teradata and Informatica offering helps customers optimize their data integration processes and deploy a single Unified Data Architecture. The new offering allows companies to design data integration processes once and deploy on any Teradata platform with up to 5x productivity gains using the resource skills they have today. In this architecture customers can process data from the traditional transaction systems as well as unstructured text and interaction data such as from social media and web logs. Customers may choose to run data integration processes on Teradata Platforms, Hadoop, or Informatica servers.

Teradata Unified Architecture with Data Integration Optimization

The key benefits for deploying the Teradata Unified Data Architecture with the Informatica PowerCenter Big Data Edition is to increase productivity up to 5x over hand-coding and up to half the labor cost, minimize the risk of Big Data projects and protect against poor and sub-optimal decisions. The offering allows the staff to focus on meeting the business requirements and not on hand-coding transformation logic so customers can innovate faster and generate new revenue streams.

Teradata and Informatica provide service assessments offerings to help customers get started quickly and make smart decisions about data integration optimizations reducing the risk of offloading the wrong work onto the wrong platform while reducing labor costs and delivering faster performance.

See the joint Informatica and Teradata press release link

Sam Tawfik