Unified Data Architecture


One way to look at progress in technology is to recognize that each new generation provides a better version of what we’ve always wanted. If you look back at the claims for Hollerith punch card-based computing or the first generation of IBM mainframes, you find that the language is recognizable and can be found in marketing material for modern technology.

This year’s model of technology (and those from 50 or 100 years ago) will provide more efficiency, transparency, automation, and productivity. Yeehaw! I can’t wait. Oh, by the way, the current generation of big data technology will provide the same thing.

And, in fact, every generation of technology has fulfilled these enduring promises, improving on what was achieved in the past. What is important to understand is how. It is often the case that in emphasizing the “new newness” of what is coming down the pike, we forget about essential elements of value in the generation of technology that is being surpassed.

This pattern is alive and well in the current transformation taking place in the world of IT related to the arrival of big data technology, which is changing so many things for the better. The problem is that exaggerations about one aspect of what is new about big data processing, “schema on read” — the ability to add structure at the last minute — is obscuring the need for a process to design and communicate a standard structure for your data, which is called “schema on write.”

Here’s the problem in a nutshell:
• In the past, the entire structure of a database was designed at the beginning of a project. The questions that needed to be answered determined the data that needed to be provided, and well-understood methods were created to model that data, that is, to provide structure so that the questions could be answered. The idea of “schema on write” is that you couldn’t really store the data until you had determined its structure.
• The world of relational database technology and the SQL language was used to answer the questions, which was a huge improvement from having to write a custom program to process each query.
• But as time passed, more data arrived and more questions needed to be answered. It became challenging to manage and change the model in an orderly fashion. People wanted to use new data and answer new questions faster than they could by waiting to get the model changed.

Okay, let’s stop and look at the good and the bad so far. The good is that structure allowed data to be used more efficiently. The more people who used the structure, the more value it created. So, when you have thousands of users asking questions and getting answers from thousands of tables, everything is super great. Taking the time to manage the structure and get it right is worth it. Schema on write is, after all, what drives business fundamentals, such as finance.

But the world is changing fast and new data is arriving all the time, which is not the strength of schema on write. If a department wants to use a new dataset, staff can’t wait for a long process where the central model is changed and the new data arrives. It’s not even clear whether every new source of data should be added to the central model. Unless a large number of people are going to use it, why bother? For discovery, schema on read makes excellent sense.

Self-service technologies like spreadsheets and other great technology for data discovery are used to find answers from this new data. What is lost in this process is the fact that almost all of this data has structure that must be described in some way before the data is used. In a spreadsheet, you need to parse most data into columns. The end-user or analyst does this sort of modeling, not the central keeper of the database, the database administrator, or some other specialist. One thing to note about this sort of modeling is that it is done to support a particular purpose. It is not done to support thousands of users. In fact, adding this sort of structure to data is not generally thought of as modeling, but it is.

Schema on write drives the business forward. So, for big data, for any data, structure must be captured and managed. The most profound evidence of this is the way that all of the “born digital” companies such as Facebook, Netflix, LinkedIn, and Twitter have added large scale SQL databases to their data platforms. These companies were forced to implement schema on write by the needs and scale of their businesses.

Schema on read leads to massive discoveries. Schema on write operationalizes them. They are not at odds; both contribute to the process of understanding data and making it useful. To make the most of all their data, businesses need both schema on read and schema on write.

Dan-Woods Data Points Teradata

Dan Woods is CTO and founder of CITO Research. He has written more than 20 books about the strategic intersection of business and technology. Dan writes about data science, cloud computing, mobility, and IT management in articles, books, and blogs, as well as in his popular column on Forbes.com.

How Analytics Turns IoT Data into Dollars

Posted on: October 19th, 2015 by Chris Twogood No Comments


The buzz around the term “Internet of Things” (IoT) amplifies with each passing day. It’s taking some time, however, for everyone to fully comprehend just how valuable this phenomenon has become for our world and our economy. Part of this has to do with the learning curve in understanding the sophisticated technologies and analytics involved. But part of it is the sheer, staggering scope of value that’s possible worldwide. A comprehensive study in June 2015 by the McKinsey Global Institute, in fact, concluded that IoT is one of those rare technology trends where the “hype may actually understate the full potential.”

The Internet of Things is our constantly growing universe of sensors and devices that create a flood of granular data about our world. The “things” include everything from environmental sensors monitoring weather, traffic or energy usage; to “smart” household appliances and telemetry from production-line machines and car engines. These sensors are constantly getting smarter, cheaper and smaller (many sensors today are smaller than a dime, and we’ll eventually see smart dust: thousands of small processors that look like dust and are sprinkled on surfaces, swallowed or poured.)

Smart Analytics Drive IoT Value

As the volume and variety of sensors and other telemetry sources grows, the connections between them and the analytic needs also grow to create an IoT value curve that’s rising exponentially as time goes on. IDC predicts the installed base of IoT connected things will reach more than 29.5 billion in 2020, with economic value-add across sectors by then topping $1.9 trillion. For all the focus on sensors and connections, however, the key driver of value is the analytics we can apply to reap insights and competitive advantage.

As we build better algorithms for the burgeoning IoT digital infrastructure, we are learning to use connection-based “smart analytics” to get very proactive in predicting future performance and conditions and even prescribing future actions. What if we could predict such a failure before it ever happens? With advanced smart analytics today, we can. It’s called predictive maintenance and it utilizes a probability-based “Weibull distribution” and other advanced processes to gauge “time to failure” rates so we can predict a machine or device breakdown before it happens.

One major provider of medical diagnostic and treatment machines has leveraged predictive maintenance to create “wearout models” for component parts in its products. This enabled early detection and identification of problems, as well as proactive root cause analysis to prevent down time and unplanned outages. A large European train manufacturer, meanwhile, is leveraging similar techniques to prevent train engine failure. It’s a key capability that has enabled the firm to expand into the leasing market – a line of business that’s profitable only if your trains remain operational.

Building IoT Architectures

There is really no limit to how far we can take this alchemy of sensors, connections and algorithms to create more and more complex systems and solutions to the problems facing businesses. But success remains impossible without the right analytics architectures in place. Most companies today still struggle to capitalize and make use of all this IoT data.

Indeed, McKinsey’s June 2015 IoT report found that less than one percent of IoT data is currently used; and those uses tend to be straightforward things like alarm activation or real-time controls rather than advanced analytics that can help optimize business processes or make predictions.

Even the most tech-savvy businesses are now realizing that extracting value from the data is a difficult and skills-intensive process. Top priorities include intelligent “listening” to massive streams of IoT data to uncover distinctive patterns that may be signposts to valuable insights. We must ingest and propagate that data in an analytical ecosystem advanced machine learning algorithms, operating at scale to reap sophisticated, actionable insights.

Agility is key: Architectures need to follow multiple streams of sensor and IoT data in real-time and deploy an agile central ingestion platform to economically and reliably listen to all relevant data. Architectures also should be configured to deploy advanced analytics – including machine learning, path, pattern, time series, statistics, graph, and text analytics – against massive volumes of data. The entire environment should be thoroughly self-service to enable rapid innovation of any new data set and avoid bogging down IT personnel with costly, requirements-driven custom projects.

These are the kind of capabilities companies must pursue to economically spot and act upon new business opportunities made possible by the Internet of Things. It takes a good deal of investment and strategic planning, but the payoff in terms of analytic insights, competitive advantage and future revenue is well worth it.

Teradata Uses Open Source to Expand Access to Big Data for the Enterprise

Posted on: September 30th, 2015 by Data Analytics Staff No Comments


By Mark Shainman, Global Program Director, Competitive Programs

Teradata’s announcement of the accelerated release of enterprise-grade ODBC/JDBC drivers for Presto opens up an ocean of big data on Hadoop to the existing SQL-based infrastructure. For companies seeking to add big data to their analytical mix, easy access through Presto can solve a variety of problems that have slowed big data adoption. It also opens up new ways of querying data that were not possible with some other SQL on Hadoop tools. Here’s why.

One of the big questions facing those who toil to create business value out of data is how the worlds of SQL and big data come together. After the first wave of excitement about the power of Hadoop, the community quickly realized that because of SQL’s deep and wide adoption, Hadoop must speak SQL. And so the race began. Hive was first out of the gate, followed by Impala and many others. The goal of all of these initiatives was to make the repository of big data that was growing inside Hadoop accessible through SQL or SQL-like languages.

In the fall of 2012, Facebook determined that none of these solutions would meet its needs. Facebook created Presto as a high-performance way to run SQL queries against data in Hadoop. By 2013, Presto was in production and released as open source in November of that year.

In 2013, Facebook found that Presto was faster than Hive/MapReduce for certain workloads, although there are many efforts underway in the Hive community to increase its speed. Facebook achieved these gains by bypassing the conventional MapReduce programming paradigm and creating a way to interact with data in HDFS, the Hadoop file system, directly. This and other optimizations at the Java Virtual Machine level allow Presto not only to execute queries faster, but also to use other stores for data. This extensibility allows Presto to query data stored in Cassandra, MySQL, or other repositories. In other words, Presto can become a query aggregation point, that is, a query processor that can bring data from many repositories together in one query.

In June 2015, Teradata announced a full embrace of Presto. Teradata would add developers to the project, add missing features both as open source and as proprietary extensions, and provide enterprise-grade support. This move was the next step in Teradata’s effort to bring open source into its ecosystem. The Teradata Unified Data Architecture provides a model for how traditional data warehouses and big data repositories can work together. Teradata has supported integration of open source first through partnerships with open source Hadoop vendors such as Hortonworks, Cloudera, and MapR, and now through participation in an ongoing open source project.

Teradata’s embrace of Presto provided its customers with a powerful combination. Through Teradata QueryGrid, analysts can use the Teradata Data Warehouse as a query aggregation point and gather data from Hadoop systems, other SQL systems, and Presto. The queries in Presto can aggregate data from Hadoop, but also from Cassandra and other systems. This is a powerful capability that enables Teradata’s Unified Data Architecture to enable data access across a broad spectrum of big data platforms.

To provide Presto support for mainstream BI tools required two things: ANSI SQL support and ODBC/JDBC drivers. Much of the world of BI access works through BI toolsets that understand ANSI SQL. A tool like QlikView, MicroStrategy, or Tableau allows a user to easily query large datasets as well as visualize the data without having to hand-write SQL statements, opening up the world of data access and data analysis to a larger number of users. Having robust BI tool support is critical for broader adoption of Presto within the enterprise.

For this reason, ANSI SQL support is crucial to making the integration and use of BI tools easy. Many of the other SQL on Hadoop projects are limited in SQL support or utilize proprietary SQL “like” languages. Presto is not one of them. To meet the needs of Facebook, SQL support had to be strong and conform to ANSI standards, and Teradata’s joining the project will make the scope and support of SQL by Presto stronger still.

The main way that BI tools connect and interact with databases and query engines is through ODBC/JDBC drivers. For the tools to communicate well and perform well, these drivers have to be solid and enterprise class. That’s what yesterday’s announcement is all about.

Teradata has listened to the needs of the Presto community and accelerated its plans for adding enterprise-grade ODBC/JDBC support to Presto. In December, Teradata will make available a free, enterprise class, fully supported ODBC driver, with a JDBC driver to follow in Q1 2016. Both will be available for download on Teradata.com.

With ODBC/JDBC drivers in place and the ANSI SQL support that Presto offers, anyone using modern BI tools can access data in Hadoop through Presto. Of course, certification of the tools will be necessary for full functionality to be available, but with the drivers in place, access is possible. Existing users of Presto, such as Netflix, are extremely happy with the announcement. As Kurt Brown, Director, Data Platform at Netflix put it, “Presto is a key technology in the Netflix big data platform. One big challenge has been the absence of enterprise-grade ODBC and JDBC drivers. We think it’s great that Teradata has decided to accelerate their plans and deliver this feature this year.”

The Benefits and Evolution of the Hadoop Appliance

Posted on: July 9th, 2015 by Chris Twogood No Comments


Running Hadoop on an appliance offers significant benefits, but as Hadoop workloads become more sophisticated, so too must the appliance. That’s exactly why we’re releasing the ‘new’ Teradata Appliance for Hadoop 5. Our new appliance has evolved alongside Hadoop usage scenarios while giving IT organizations more freedom of choice to run diverse workloads. Running Hadoop on an appliance makes more sense than ever before.

If you’re running – or thinking about running – Hadoop on an appliance, you’re not alone. According to an ESG survey reported on by SearchDataCenter.com, 21% of IT organizations are considering dedicated analytics appliances. That’s the same percentage of organizations that are considering public cloud solutions and double those considering a public/private hybrid deployment. What is driving the adoption of Hadoop appliances?

5 Key Benefits of Running Hadoop on an Appliance

Organizations that choose to deploy Hadoop on an appliance versus rolling out their own solution realize five important benefits.

  1. Hadoop is delivered ready to run.

We’ve heard industry experts say that it can take IT organizations six to eight months to roll out a Hadoop implementation on their own. With a Teradata appliance, we’ve done all the hard work in terms of installing and configuring multiple types of software as well as installing and configuring the operating system, networking and the like. You simply plug it in, and within days you are up and running.

  1. We’vebuilt high availability into our Hadoop appliances.

The Teradata Vital Infrastructure (TVI) proactively detects and resolves incidents. In fact, up to 72% of all hardware- and software-related incidents are detected and resolved by TVI before the customer even knows about them. We also run BYNET over InfiniBand, which delivers automated network load balancing, automated network failover, redundancy across two active fabrics, and multiple levels of network isolation. These features in Teradata Appliance for Hadoop 5 deliver the high availability IT organizations need in an enterprise-grade solution.

  1. It is Unified Data Architecture ready.

It’s not enough to just efficiently deploy Hadoop. IT organizations must be able to efficiently deploy Hadoop as a seamless part of an interconnected analytics ecosystem. The UDA-ready Hadoop appliance becomes an integral part of the organization’s larger data fabric, with BYNET over InfiniBand interconnect between Hadoop, the Integrated Data Warehouse and Aster big data analytics, and software integration such as QueryGrid, Viewpoint, TDCH, and Smart Loader.

  1. Single vendor support.

An appliance replaces the multiple support contracts IT organizations have with their hardware provider, Hadoop vendor, OS vendor, and various utilities, with a single “hand to shake.” If there’s any problem, one phone call puts you in touch with Teradata’s world-class, 24/7, multi-language support for the entire solution stack. IT organizations are seeing increasing value in this benefit as the Hadoop ecosystem has many moving parts associated with it, and single vendor support provides peace of mind.

  1. Running Hadoop on an appliance lowers your total cost of ownership (TCO)

The cost of a Hadoop appliance includes much more than the hardware the software runs on. There are also costs associated with configuring the network, installing the OS, configuring the disks, installing the Hadoop environment, tuning the Hadoop environment, and testing. The costs for doing all this work internally add up, making the TCO of an appliance even more attractive.

What’s New with Teradata Appliance for Hadoop 5?

In addition to these five benefits, Teradata Appliance for Hadoop 5 delivers freedom of choice to run a variety of workloads. IT organizations now have more options when they run Hadoop on Teradata Appliance 5.

Recognizing that Hadoop workloads are diverse and evolving, Teradata Appliance for Hadoop 5 is available in three flexible configurations, enabling customers to select the configuration that best fits their workloads.

  • Performance configuration. For real-time processing and other workloads that require significant CPU, IO, and memory, we offer the performance configuration. This computational intensive configuration enables organizations to run emerging Hadoop workloads such as streaming, Spark, and SQL on Hadoop. With 24 cores, this configuration has more cores per node. It also has 512TB of RAM, 24 storage disks and 1.2TB drives.
  • Capacity configuration. The capacity configuration allows IT organizations to drive down the cost per terabyte. It is designed for heavy duty, long-running batch jobs as well as long-term archival and storage. It comes with 128- to 256TB RAM and 4TB disk drives.
  • Balance configuration. The balance configuration sits between the performance and capacity configurations, allowing IT organizations to strike the right balance for ETL and analytics jobs. The balance configuration features 24 cores and a 4TB capacity drive.

Learn more about Teradata’s Portfolio for Hadoop.


00-11-HC-QOC-BQ-DataLabZoomed in view of Data Analytics Graph
(Healthcare Example)

<---- Click on image to view GRAPH ANIMATION

In the first part of this two part blog series, I discussed the competitive importance of cross-functional analytics [1]. I also proposed that by treating Data and Analytics as a network of interconnected nodes in Gephi [2], we can examine a statistical metric for analytics called Degree Centrality [3]. In this second part of the series I will now examine parts of the sample Healthcare industry graph animation in detail and draw some high level conclusions from the Degree Centrality measurement for analytics.

In this sample graph [4], link analysis was performed on a network of 3428 nodes and 8313 directed edges. Majority of the nodes represent either Analytics or Source Data Elements. Many Analytics in this graph tend to require data from multiple source systems resulting in cross functional Degree Centrality (connectedness). Some of the Analytics in this study display more Degree Centrality than others.

The zoomed-in visualization starts with a single source system (green) with its data elements (cyan). Basic function specific analytics (red) can be performed with this single Clinical source system data. Even advanced analytics (Text Analysis) can be applied to this single source of data to yield function specific insights.

But data and business never exist separately in isolation. Usually cross-functional analytics emerge with users looking to gain additional value from combining data from various source systems. Notice how these new analytics are using data from source systems in multiple functional areas such as Claims and Membership. Such cross functional data combination or data coupling can now be supported at various levels of sophistication. For instance, data can be loosely coupled for analysis with data virtualization, or if requirements dictate, it can be tightly coupled within a relational Integrated Data Warehouse.

As shown in the graph, even advanced analytics such as Time Series and Naïve Bayes can utilize data from multiple source systems. A data platform that can loosely couple or combine data for such cross-functional advanced analytics can be critical for efficient discovering insights from new sources of data (see discovery platform). More importantly as specific advanced analytics are eventually selected for operationalization, a data platform needs to easily integrate results and support easy access regardless of where the advanced analytics are being performed.

Degree Ranking for sample Analytics from the Healthcare Industry Graph

Degree Analytic Label
3 How can we reduce manual effort required to evaluate physician notes and medical records in conjunction with billing procedure codes?
10 How can number of complaints to Medicare be reduced in an effort to improve the overall STAR rating?
22 What is the ratio of surgical errors to hospital patients? And total medical or surgical errors? (Provider, Payer)
47 What providers are active in what networks and products? What is the utilization? In total, by network, by product
83 What are the trends over time for utilization for patients who use certain channels?
104 What is the cost of care PMPM?   For medical, For Pharmacy, Combined.   How have clinical interventions impacted this cost over time?

The sample analytics listed above demonstrate varying degree of cross-functional Degree Centrality and should be supported with varying level of data coupling. This can range from non-coupled data to loosely coupled data to tightly coupled data. As the number of Analytics with cross-functional Degree Centrality cluster together it may indicate a need to employ tighter data coupling or data integration to drive consistency in the results being obtained. The clustering of Analytics may also be an indication of an emerging need for a data mart or extension of Integrated Data Warehouse that can be utilized by a broader audience.

In-Degree Ranking for sample Data Elements from the Healthcare Industry Graph

In-Degree Source Element
46 Accounts Receivable*PROVIDER BILL-Bill Payer Party Id
31 Clinical*APPLICATION PRODUCT-Product Id
25 Medical Claims*CLAIM-Claim Num
25 Membership*MEMBER-Agreement Id

Similarly if Data start to show high Degree Centrality it may be an indication for re-assessing whether there is a need for tighter coupling to drive consistency and enable broader data reuse. When the In-Degree metric is applied, Data being used by more Analytics appears larger on the graph and is a likely candidate for tighter coupling. To support data design for tighter coupling from a cross functional and even a cross industry perspective Teradata offers reference data model blueprints by industry. (See Teradata Data Models)

This calls for a data management ecosystem with data analytics platforms that can easily harvest this cross-functional Degree Centrality of Analytics and Data. Such a data management ecosystem would support varying degrees of data coupling, varying types of analytics, and varying types of data access based on data users. (Learn more about Teradata’s Unified Data Architecture.)

The analysis described above is exploratory and by no means a replacement for a thorough architectural assessment. Eventually the decision to employ the right degree of data coupling should rest on the full architecture requirements including but not limited to data integrity, security, or business value.

In conclusion, what our experiences have taught us in the past will still hold true for the future:
• Data sources are exponentially more valuable when combined or integrated with other data sets
• To maintain sustained competitive advantage business has to continue to search for insights building on the cross-functional centrality of data
• Unified data management ecosystems can now harvest this cross-functional centrality of data at a lower cost with efficient support for varying levels of data integration, analytic types, and users

Contact Teradata to learn more about how Teradata technology, architecture, and industry expertise can efficiently and effectively harvest this centrality of Data and Analytics.

[1] https://hbr.org/2014/11/how-smart-connected-products-are-transforming-competition

[2] Gephi is a tool to explore and understand graphs. It is a complementary tool to traditional statistics.

[3] Degree centrality is defined as the number of links incident upon a node (i.e., the number of ties that a node has).

[4] This specific industry example is illustrative and subject to the limitations of assumptions and quality of the sample data mappings used for this study.

Ojustwin blog bio



Ojustwin Naik (MBA, JD) is a Director with 15 years of experience in planning, development, and delivery of Analytics. He has experience across multiple industries and is passionate at nurturing a culture of innovation based on clarity, context, and collaboration.



High Level Data Analytics Graph
(Healthcare Example)

 <---- Click on image to view GRAPH ANIMATION

Michael Porter, in an excellent article in the November 2014 issue of the Harvard Business Review[1], points out that smart connected products are broadening competitive boundaries to encompass related products that meet a broader underlying need. Porter elaborates that the boundary shift is not only from the functionality of discrete products to cross-functionality of product systems, but in many cases expanding to a system of systems such as a smart home or smart city.

So what does all this mean from a data perspective? In that same article, Porter mentions that companies seeking leadership need to invest in capturing, coordinating, and analyzing more extensive data across multiple products and systems (including external information). The key take-away is that the movement of gaining competitive advantage by searching for cross-functional or cross-system insights from data is only going to accelerate and not slow down. Exploiting cross-functional or cross-system centrality of data better than anyone else will continue to remain critical to achieving a sustainable competitive advantage.

Understandably, as technology changes, the mechanisms and architecture used to exploit this cross-system centrality of data will evolve. Current technology trends point to a need for a data & analytic-centric approach that leverages the right tool for the right job and orchestrates these technologies to mask complexity for the end users; while also managing complexity for IT in a hybrid environment. (See this article published in Teradata Magazine.)

As businesses embrace the data & analytic-centric approach, the following types of questions will need to be addressed: How can business and IT decide on when to combine which data and to what degree? What should be the degree of data integration (tight, loose, non-coupled)? Where should the data reside and what is the best data modeling approach (full, partial, need based)? What type of analytics should be applied on what data?

Of course, to properly address these questions, an architecture assessment is called for. But for the sake of going beyond the obvious, one exploratory data point in addressing such questions could be to measure and analyze the cross-functional/cross-system centrality of data.

By treating data and analytics as a network of interconnected nodes in Gephi[2], the connectedness between data and analytics can be measured and visualized for such exploration. We can examine a statistical metric called Degree Centrality[3] which is calculated based on how well an analytic node is connected.

The high level sample data analytics graph demonstrates the cross-functional Degree Centrality of analytics from an Industry specific perspective (Healthcare). It also amplifies, from an industry perspective, the need for organizations to build an analytical ecosystem that can easily harness this cross-functional Degree Centrality of data analytics. (Learn more about Teradata’s Unified Data Architecture.)

In the second part of this blog post series we will walk through a zoomed-in view of the graph, analyze the Degree Centrality measurements for sample analytics, and draw some high-level data architecture implications.

[1] https://hbr.org/2014/11/how-smart-connected-products-are-transforming-competition

[2] Gephi is a tool to explore and understand graphs. It is a complementary tool to traditional statistics.

[3] Degree centrality is defined as the number of links incident upon a node (i.e., the number of ties that a node has).

Ojustwin blog bio

Ojustwin Naik (MBA, JD) is a Director with 15 years of experience in planning, development, and delivery of Analytics. He has experience across multiple industries and is passionate at nurturing a culture of innovation based on clarity, context, and collaboration.

LA kicks off the 2014 Teradata User Group Season

Posted on: April 22nd, 2014 by Guest Blogger No Comments


By Rob Armstrong,  Director, Teradata Labs Customer Briefing Team

After presenting for years at the Teradata User Group meetings, it was refreshing to see some changes in this roadshow.  While I had my usual spot on the agenda to present Teradata’s latest database release (15.0), we had some hot new topics including Cloud and Hadoop, some more business level folks were there, more companies researching Teradata’s technology (vs. just current users) and there was a hands-on workshop the following day for the more technically inclined looking to walk through real world Unified Data Architecture™ (UDA) use cases of a Teradata customer.  While LA tends to be a smaller venue than most, the room was packed and we had 40% more attendees compared with last year.

I would be remiss if I did not give a big Thanks to the partner sponsors of the user group meeting.  In LA we had Hortonworks and Dot Hill as our gold and silver sponsors.  I took a few minutes to chat with them and found out some interesting upcoming items.  Most notably, Lisa Sensmeier from Hortonworks talked to me about Hadoop Summit which is coming up in San Jose, June 3-5th.  Jim Jonez, from Dot Hill, gave me the latest on their newest “Ultra Large” disk technology where they’ll have 48 1 TB drives in a single 2U rack.  It is not in the Teradata line up yet, but we are certainly intrigued for the proper use case.

Now, I’d like to take a few minutes to toot my own horn about the Teradata Database 15.0 presentation that had some very exciting elements to help change the way users get to and analyze all of their data.  You may have seen the recent news releases, but if not, here is a quick recap:

  • Teradata 15.0 continues our Unified Data architecture with the new Teradata QueryGrid.  This is the new environment to define and access data from Teradata to other data servers such as Apache Hadoop (Hortonwoks), Teradata Aster Discovery Platform, Oracle, and others.  This lays the foundation for an extension to even more foreign data servers.  15.0 simplifies the whole definition and usage as well as added bi-directional and predicate pushdown.  In a related session, Cesar Rojas provided some good recent examples of customers taking advantage of the entire UDA ecosystem where data from all of the Teradata offerings were used together to generate new actions.
  • The other big news in 15.0 is the inclusion of the JSON data type.  This allows customers to store direct JSON documents in a column and then apply “schema on read” for much greater flexibility with greatly reduced IT resources.  As the JSON document changes, there is no table or database changes necessary to absorb the new content.

Keep your eyes and ears open for the next Teradata User Group event coming your way, or better yet, just go to the webpage: http://www.teradata.com/user-groups/ to see where the bus stops next and to register.  The TUGs are free of charge.  Perhaps we’ll cross paths as I make the circuit? Until then, ‘Keep Calm and Analyze On’ (as the cool kids say).

 Since joining Teradata in 1987, Rob Armstrong has worked in all areas of the data warehousing arena.  He has gone from writing and supporting the database code to implementing and managing systems at some of Teradata’s largest and most innovative customers.  Currently Rob provides sales and marketing support by traveling the globe and evangelizing the Teradata solutions.


The best Strata session that I attended was the overview Kurt Brown gave of the Netflix data platform, which contained hype-deflating lessons and many chestnuts of tech advice straight from one of the most intense computing environments on the planet.

Brown, who as a director leads the design and implementation of the data platform, had a cheerful demeanor but demonstrated ruthless judgment and keen insight in his assessment of how various technologies serve the goals of Netflix. It was interesting to me how dedicated he was to both MPP SQL technology and to Apache™ Hadoop.

I attended the session with Daniel Graham, Technical Marketing Specialist of Teradata, who spoke with me afterward about the implications of the Netflix architecture and Brown’s point of view.

SQL Vs Hadoop
Brown rejected the notion that it was possible to build a complete data platform exclusively using either SQL technology or Hadoop alone. In his presentation, Brown explained how Netflix made great use of Hadoop, used Hive for various purposes, and had an eye on Presto, but also couldn’t live without Teradata and Microstrategy as well.

Brown recalled a conversation in which another leader of a data platform explained that he was discarding all his data warehouse technology and going to put everything on Hive. Brown’s response, “Why would you ever want to do that?”

While Brown said he enjoyed the pressure that open source puts on commercial vendors to improve, he was dedicated to using whatever technology could provide answers to questions in the most cost-effective manner. Brown said he was especially pleased that Teradata was going to be able to support a cloud-based implementation that could run at scale. Brown said that Netflix had upwards of 5 petabytes of data in the cloud, all stored on Amazon S3.

After the session, I pointed out to Graham that the pattern in evidence at Netflix and most of the companies who are acknowledged as the leaders in big data, mimics the recommendation of the white paper “Optimize the Value of All Your Enterprise Data” that provides an overview of the Teradata Unified Data Architecture™.

The Unified Data Architecture recommends that that the data that has the most “business value density” be stored in an enterprise data warehouse powered by MPP SQL. This data is used most often by the most users. Hadoop is used as a data refinery to process flat files or NoSQL data in batch mode.

Netflix is a big data companies that arrived at this pattern by adding SQL to a Hadoop infrastructure. Many well-known users of huge MPP SQL installations have added Hadoop.

“Data doesn’t stay unstructured for long. Once you have distilled it, it usually has a structure that is well-represented by flat files,” said Teradata's Graham. “This is the way that the canonical model of most enterprise activity is stored. Then the question is: How you ask questions of that data? There are numerous ways to make this easy for users, but almost all of those ways pump out SQL that then is used to grab the data that is needed.”

Replacing MPP SQL with Hive or Presto is a non-starter because to really support hundreds or thousands of users who are pounding away at a lot of data, you need a way to provide speedy and optimized queries and also to manage the consumption of the shared resources.

“For over 35 years, Teradata has been working on making SQL work at scale for hundreds or thousands of people at a time,” said Graham. “It makes perfect sense to add SQL capability to Hadoop, but it will be a long time, perhaps a decade or more, before you will get the kind of query optimization and performance that Teradata provides. The big data companies use Teradata and other MPP SQL systems because they are the best tool for the job for making huge datasets of high business value density available to an entire company.”

Efforts such as Tez and Impala will clearly move Hive’s capability forward. The question is how far forward and how fast. We will know that victory has been achieved when Netflix, which uses Teradata in a huge cloud implementation, is able to support their analytical workloads with other technology.

Graham predicts that in 5 years, Hadoop will be a good data mart but will still have trouble with complex parallel queries.

“It is common for a product like Microstrategy to pump out SQL statements that may be 10, 20, or even 50 pages long,” said Graham. “When you have 5 tables, the complexity of the queries could be 5 factorial. With 50 tables, that grows to 50 factorial. Handling such queries is a 10- or 20-year journey. Handling them at scale is a feat that many companies can never pull off.”

Graham acknowledges the need for an MPP SQL data warehouse extended to support data discovery, e.g. Teradata Aster Discovery Platform, along with the extensions for using Hadoop and graph analytics through enhanced SQL, is needed by most businesses.

Teradata is working to demonstrate that the power of this collection of technology can address some of the unrealistic enthusiasm surrounding Hadoop.

By: Dan Woods, Forbes Blogger and Co-Founder of Evolved Media


In years past, Strata has celebrated the power of raw technology, so it was interesting to note how much the keynotes on Wednesday focused on applications, models, and how to learn and change rather than on speeds and feeds.

After attending the keynotes and some fascinating sessions, it seems clear that the blinders are off. Big data and data science have been proven in practice by many innovators and early adopters. The value of new forms of data and methods of analysis are so well established that there’s no need for exaggerated claims. Hadoop can do so many cool things that it doesn’t have to pretend to do everything, now or in the future. Indeed, the pattern in place at Facebook, Netflix, the Obama Campaign, and many other organizations with muscular data science and engineering departments is that MPP SQL and Hadoop sit side by side, each doing what they do best.

In his excellent session, Kurt Brown, Director, Data Platform at Netflix, recalled someone explaining that his company was discarding its data warehouse and putting everything on Hive. Brown responded, “Why would you want to do that?” What was obvious to Brown, and what he explained at length, is that the most important thing any company can do is assemble technologies and methods that serve its business needs. Brown demonstrated the logic of creating a broad portfolio that serves many different purposes.

Real Value for Real People
The keynotes almost all celebrated applications and models. Vendors didn’t talk about raw power, but about specific use cases and ease-of-use. Farrah Bostic, a marketing and product design consultant, recommended ways to challenge assumptions and create real customer intimacy. This was a key theme: Use the data to understand a person in their terms not yours. Bostic says you will be more successful if you focus on creating value for the real people who are your customers instead of extracting value from some stilted and limited model of a consumer. A skateboarding expert and a sports journalist each explained models and practices for improving performance. This is a long way from the days when a keynote would show a computer chewing through a trillion records.

Geoffrey Moore, the technology and business philosopher, was in true provocative form. He asserted that big data and data science are well on their way to crossing the chasm because so many upstarts pose existential threats to established businesses. This pressure will force big data to cross the chasm and achieve mass adoption. His money quote: "Without big data analytics, companies are blind and deaf, wandering out onto the Web like deer on the freeway.”

An excellent quote to be sure, but it goes too far. Moore would have been more accurate and less sensational if he said, “Without analytics,” not “Without big data analytics.” The reason that MPP SQL and Hadoop have made such a perfect pair is because more than one type of data and method of analysis is needed. Every business needs all the relevant data it can get to understand the people it does business with.

The Differentiator: A Culture of Analytics
The challenge I see companies facing lies in creating a culture of analytics. Tom Davenport has been a leader in promoting analytics as a means to competitive advantage. In his keynote at Strata Rx in September 2013, Davenport stressed the importance of integration.

In his session at Strata this year, Bill Franks, Chief Analytics Officer at Teradata, put it quite simply, "Big data must be an extension of an existing analytics strategy. It is an illusion that big data can make you an analytics company."

When people return from Strata and roll up their sleeves to get to work, I suspect that many will realize that it’s vital to make use of all the data in every way possible. But one person can only do so much. For data to have the biggest impact, people must want to use it. Implementing any type of analytics provides supply. Leadership and culture create demand. Companies like CapitalOne and Netflix don’t do anything without looking at the data.

I wish there were a shortcut to creating a culture of analytics, but there isn’t, and that’s why it’s such a differentiator. Davenport’s writings are probably the best guide, but every company must figure this out based on its unique situation.

Supporting a Culture of Analytics
If you are a CEO, your job is to create a culture of analytics so that you don’t end up like Geoffrey Moore’s deer on the freeway. But if you have Kurt Brown’s job, you must create a way to use all the data you have, to use the sweet spot of each technology to best effect, and to provide data and analytics to everyone who wants them.

At a company like Netflix or Facebook, creating such a data supply chain is a matter of solving many unique problems connected with scale and advanced analytics. But for most companies, common patterns can combine all the modern capabilities into a coherent whole.

I’ve been spending a lot of time with the thought leaders at Teradata lately and closely studying their Unified Data Architecture. Anyone who is seeking to create a comprehensive data and analytics supply chain of the sort in use at leading companies like Netflix should be able to find inspiration in the UDA, as described in a white paper called “Optimizing the Business Value of All Your Enterprise Data.”

The paper does excellent work in creating a framework for data processing and analytics that unifies all the capabilities by describing four use cases: the file system, batch processing, data discovery, and the enterprise data warehouse. Each of these use cases focuses on extracting value from different types of data and serving different types of users. The paper proposes a framework for understanding how each use case creates data with different business value density. The highest volume interaction takes place with data of the highest business value density. For most companies, this is the enterprise data warehouse, which contains a detailed model of all business operations that is used by hundreds or thousands of people. The data discovery platform is used to explore new questions and extend that model. Batch processing and processing of data in a file system extract valuable signals that can be used for discovery and in the model of the business.

While this structure doesn’t map exactly to that of Netflix or Facebook, for most businesses, it supports the most important food groups of data and analytics and shows how they work together.

The refreshing part of Strata this year is that thorny problems of culture and context are starting to take center stage. While Strata will always be chock full of speeds and feeds, it is even more interesting now that new questions are driving the agenda.

By: Dan Woods, Forbes Blogger and Co-Founder of Evolved Media

Open Your Mind to all the Data

Posted on: February 14th, 2014 by Guest Blogger No Comments


In technology discussions, we often forget who the big winners are. When you look at almost any technology market, if a vendor makes a certain amount, the systems integrator may make 4 or 5 times that amount selling services to configure and adapt the solution. Of course, the biggest winner is the company using the technology, with a return of 10X, 20X, or even 100X. This year, I’m attending Strata and trying to understand how to hunt down and capture that 100X return.

I’m not alone in this quest. In the past Strata was mostly a technology conference, focused on vendor or systems integrator return. But this year, it is clear that the organizers realized they must find a way to focus on the 100X return, as evidenced by the Data-driven Business track. If Strata becomes an event where people can identify and celebrate the pathway to the 100X return, attendance will skyrocket.

Most of the time the 100X return comes from using data that only applies to a specific business. If you realize those kind of returns, are you really going to come to Strata to talk about it? I don’t think so. So we won’t find sessions that provide a complete answer for your business. You will only get parts. The challenge when attending a conference like Strata is how to find all the parts, put them together, and come home with ideas for getting that 100X return.

My answer is to focus on questions, then on data, and then on technology that can find answers to the questions. There are many ways to run this playbook, but here’s one way to make it work.

Questions Achieve Focus
The problem that anyone attending Strata or any large conference faces is the huge array of vendors and sessions. One way to guide exploration is through curiosity. This maintains enthusiasm during the conference but may not give you a 100X idea. Another way is to begin with a predetermined problem in mind. This is a good approach, but it may cut off avenues that lead to a 100X result.

Remember: 100X results are usually found through experimentation. Everything with an obvious huge return is usually already being done. You have to find a way to be open to new ideas that are focused on business value. Here’s one way I’ve thought of, called the Question Game.

The Question Game is a method for aligning technology with the needs of a business. It is also a great way to organize a trip to a conference like Strata, where the ratio of information to marketing spin is high.

I came up with the Question Game while reading Walter Isaacson’s biography of Steve Jobs. Two things struck me. First was the way that Jobs and Tim Cook cared a lot about focus. Jobs held an annual offsite attended by about 100 of Apple’s top people. At the end of the session, Jobs would write up the 10 most important initiatives and then cross out 7 so the team could focus on executing just 3. Both Jobs and Cook were proud of saying no to ideas that detracted from their focus.

Here’s how Tim Cook describes it: “We are the most focused company that I know of, or have read of, or have any knowledge of. We say no to good ideas every day. We say no to great ideas in order to keep the amount of things we focus on very small in number, so that we can put enormous energy behind the ones we do choose, so that we can deliver the best products in the world.”

Because I focus most of my time figuring out how technology leaders can make magic in business, I was eager to find away to empower people to say no. At the same time, I wanted room for invention and innovation. Is it possible to explore all the technology and data out there in an efficient way focused on business needs? Yes, using the Question Game. Here’s how it works:

  • The CIO, CTO, or whoever is leading innovation surveys each business leader, asking for questions they would like to have answered and the business value of answering them
  • Questions are then ranked by value, whether monetary or otherwise

The Question Game provides a clear way for the business to express its desires. With these questions in mind, Strata becomes a hunting ground far more likely to yield a path to 100X ideas. To prepare for this hunt, list all the questions and look at the agenda to find relevant sessions.

It’s the Data, All the Data
With your questions in hand, keep reminding yourself that all the technology in the world won’t answer the questions on your list. Usually, your most valuable questions will be answered by acquiring or analyzing data. The most important thing you can do is look for sources of data to answer the high value questions.

The first place to look for data is inside your company. In most companies, it is not easy to find all the available data and even harder to find additional data that could be available with a modest effort.

Too often the search for data stops inside the company. As I pointed out in “Do You Suffer from the Data Not Invented Here Syndrome?” many companies have blinders to external data sources, which can be an excellent way to shed light on high value questions. Someone should be looking for external data as well.

Remember, it is vital to keep an open mind. The important thing is to find data that can answer high value questions, not to find any particular type of data. Strata is an excellent place to do this. Some sessions shed light on particular types of data, but I’ve found that working the room, asking people what kinds of data they use, and showing them your questions can lead to great ideas.

Finding Relevant Technology
Once you have an idea about the important questions and the relevant data, you will be empowered to focus. When you tour the vendors, you can say no with confidence when a technology doesn’t have a hope of processing the relevant data to answer a high value question. Of course, it is impossible to tell from a session description or a vendor’s website if they will be relevant to a specific question. But by having the questions in mind and showing them to people you talk to, you will greatly speed the path to finding what you need. When others see your questions, they will have suggestions you didn’t expect.

Will this method guarantee you a 100X solution every time you attend Strata? I wouldn’t make that claim. But by following this plan, you have a much better chance for a victory.

While I'm at Strata, I'll be playing the question game so I can quickly and effectively learn a huge amount about the fastest path to a data-driven business.

By: Dan Woods, Forbes Blogger and Co-Founder of Evolved Media

To Learn More:

Optimize the Business Value of All Your Enterprise Data (white paper)
Bring Dark Data into the Light (infographic)
Benefit from Any and All Data (article)
To Succeed with Big Data, Begin with the Decision in Mind (article)