Why Should Big Data Matter to You?

Posted on: September 15th, 2015 by Marc Clark No Comments


With all the attention given to big data, it is no surprise that more companies feel pressure to explore the possibilities for themselves. The challenge for many has been the high barriers to entry. Put simply, big data has cost big bucks. Maybe even more perplexing has been uncertainty about just what big data might deliver for a given company. How do you know if big data matters to your business?

The advent of cloud-based data warehouse and analytics systems can eliminate much of that uncertainty. For the first time, it is possible to explore the value proposition of big data without the danger of drowning the business in the costs and expertise needed to get big data infrastructure up and running.

cloud analytics Marc Clark Teradata

Subscription-based models replace the need to purchase expensive hardware and software with the possibility of a one-stop-shopping experience where everything—from data integration and modeling tools to security, maintenance and support—is available as a service. Best of all, the cloud makes it feasible to evaluate big data regardless of whether your infrastructure is large and well-established with a robust data warehouse, or virtually nonexistent and dependent on numerous Excel worksheets for analysis.

Relying on a cloud analytics solution to get you started lets your company test use cases, find what works best, and grow at its own pace.

Why Big Data May Matter

Without the risk and commitment of building out your own big data infrastructure, your organization is free to explore the more fundamental question of how your data can influence your business. To figure out if big data analytics matters to you, ask yourself and your company a few questions:

  • Are you able to take advantage of the data available to you in tangible ways that impact your business?
  • Can you get answers quickly to questions about your business?
  • Is your current data environment well integrated, or a convoluted and expensive headache?

For many organizations, the answer to one or more of these questions is almost certainly a sore point. This is where cloud analytics offers alternatives, giving you the opportunity to galvanize operations around data instead of treating data and your day-to-day business as two separate things. The ultimate promise of big data is not one massive insight that changes everything. The goal is to create a ceaseless conveyor belt of insights that impact decisions, strategies, and practices up, down, and across the operational matrix.

The Agile Philosophy for Cloud Analytics

We use the word agile a lot, and cloud analytics embraces that philosophy in important new ways. In the past, companies have invested a lot of time, effort, and money in building infrastructure to integrate their data and create models. Then they find themselves trapped in an environment that doesn’t suit their requirements.

Cloud analytics provides a significant new path. It's a manageable approach that enables companies to get to important questions without bogging down in technology.

And, to really figure out what value is lurking in their data and what its impact might be.

To learn more, download  our free Enterprise Analytics in the Cloud eBook.

Big Data Success Starts With Empowerment: Learn Why and How

Posted on: September 1st, 2015 by Chris Twogood No Comments


As my colleague Bill Franks recently pointed out on his blog, there is often the perception that being data-driven is all about technology. While technology is indeed important, being data-driven actually spans a lot of different areas, including people, big data processes, access, a data-driven culture and more. In order to be successful with big data and analytics, companies need to fundamentally embed it into their DNA.

To be blunt, that level of commitment simply must stem from the top rungs of any organization. This was evident when Teradata recently surveyed 316 senior data and IT executives. The commitment to big data was far more apparent at companies where CEOs personally focus on big data initiatives, as over half of those respondents indicated it as the single most important way to gain a competitive advantage.

Big Data Success Starts With Empowerment, Chris Twogood, Data Points, TeradataIndeed, industries with the most competitive environments are the ones leading the analytics push. These companies simply must find improvements, even if the needle is only being moved in the single digits with regards to things like operational costs and revenue.

Those improvements don’t happen without proper leadership, especially since a data-driven focus impacts just about all facets of the business -- from experimentation to decision-making to rewarding employees. Employees must have access to big data, feel empowered with regards to applying it and be confident in their data-driven decisions.

In organizations where being data-driven isn’t embedded in the DNA, someone may go make a decision and attempt to leverage a little data. But, if they don’t feel empowered by the data’s prospects and aren’t confident in the data, they will spend a lot of cycles seeking validation. A lot of time will be spent simply attempting to ensure they have the right data, the accurate data, that they are actually making the right decision based on it and that they will be backed up once that decision is made.

There is a lot of nuance with regards to being data-driven, of course. While all data has value, there are lots of levels to that value – the challenge generally lies in recognizing the values and extracting it. Our survey confirmed, for instance, just how hot location data is right now, as organization work to understand the navigation of their customers in order to deliver relevant communication.

Other applications of data, according to the survey, include the creation of new business models, the discovery of new product offers, and the monetization of data to external companies. But that’s just the tip of the iceberg. Healthcare, for example, is an up-and-coming industry with regards to data usage. An example is better understanding path to surgery -- breaking down the four or five steps most important to achieving a better patient outcome.

But whether you’re working in a hospital or a hot startup, and working to carve out more market share or improve outcomes for patients, the fundamentals we’ve been discussing here remain the same. Users must be empowered and confident in order to truly be data-driven -- and they’re not going to feel that way unless those at the top are leading the way.



By Imad Birouty, Teradata Product Marketing Manager

In-memory database processing is a hot topic in the market today. It promises to bring high performance to OLTP and Data Warehouse environments.  As such, many vendors are working hard to develop in-memory database technology.

Memory is fast, but still expensive when compared to disk storage. As such, it should be treated as a precious resource and used wisely for the best return on your investment.

Teradata Intelligent Memory  does just that. Through advanced engineering techniques, the Teradata Database automatically places the most frequently accessed data in memory, delivering in-memory performance with the cost economics of disk storage. The 80/20 rule and proven real-world data warehouse usage patterns shows that a small percentage of the data accounts for the vast majority of data access. Teradata Database’s unique multi-temperature data management infrastructure makes it possible to leverage this and keep only the most frequently used data in memory to achieve in-memory performance for the entire database. This is cutting-edge technology and does not require a separate dedicated in-memory database to manage. And because it's built into the Teradata Database, companies get the scalability, manageability, and robust features associated with the Teradata Database.

Forrester Research just released their inaugural Wave dedicated to in-memory:  The Forrester Wave™: In-Memory Database Platforms, Q3 2015 evaluation, naming Teradata a leader. Teradata has always been a pioneer in scalable, disk-based, shared-nothing RDBMS.  Because it has continued to evolve, change, and incorporate the latest technologies, the Teradata Database is now a leader in in-memory database processing too.

While the Forrester Wave evaluated Teradata Database 15.0., we are even more excited about Teradata Database 15.10 which utilizes even more advanced in-memory techniques that are integrated into the Teradata Database. New in-memory accelerators such as pipelining, vectorization, bulk qualification, and columnar storage are integrated into the Teradata Database and bring in-memory performance to all data in the warehouse, including multi-structured data types such as JSON and weblogs which are associated with Big Data.

A free copy of the Forrester Wave report is available here, as well as today’s news release here. 

We’ll be announcing availability of Teradata Database 15.10 in a few weeks, so look for that announcement.


Why We Love Presto

Posted on: June 24th, 2015 by Daniel Abadi No Comments


Concurrent with acquiring Hadoop companies Hadapt and Revelytix last year, Teradata opened the Teradata Center for Hadoop in Boston. Teradata recently announced that a major new initiative of this Hadoop development center will include open-source contributions to a distributed SQL query engine called Presto. Presto was originally developed at Facebook, and is designed to run high performance, interactive queries against Big Data wherever it may live --- Hadoop, Cassandra, or traditional relational database systems.

Among those people who will be part of this initiative and contributing code to Presto include a subset of the Hadapt team that joined Teradata last year. In the following, we will dive deeper into the thinking behind this new initiative from the perspective of the Hadapt team. It is important to note upfront that Teradata’s interest in Presto, and the people contributing to the Presto codebase, extends beyond the Hadapt team that joined Teradata last year. Nonetheless, it is worthwhile to understand the technical reasoning behind the embrace of Presto from Teradata, even if it presents a localized view of the overall initiative.

Around seven years ago, Ashish Thusoo and his team at Facebook built the first SQL layer over Hadoop as part of a project called Hive. At its essence, Hive was a query translation layer over Hadoop: it received queries in a SQL-like language called Hive-QL, and transformed them into a set of MapReduce jobs over data stored in HDFS on a Hadoop cluster. Hive was truly the first project of its kind. However, since its focus was on query translation into the existing MapReduce query execution engine of Hadoop, it achieved tremendous scalability, but poor efficiency and performance, and ultimately led to a series of subsequent SQL-on-Hadoop solutions that claimed 100X speed-ups over Hive.

Hadapt was the first such SQL-on-Hadoop solution that claimed a 100X speed-up over Hive on certain types of queries. Hadapt was spun out of the HadoopDB research project from my team at Yale and was founded by a group of Yale graduates. The basic idea was to develop a hybrid system that is able to achieve the fault-tolerant scalability of the Hive MapReduce query execution engine while leveraging techniques from the parallel database system community to achieve high performance query processing.

The intention of HadoopDB/Hadapt was never to build its own query execution layer. The first version of Hadapt used a combination of PostgreSQL and MapReduce for distributed query execution. In particular, the query operators that could be run locally, without reliance on data located on other nodes in the cluster, were run using PostgreSQL’s query operator set (although Hadapt was written such that PostgreSQL could be replaced by any performant single-node database system). Meanwhile, query operators that required data exchange between multiple nodes in the cluster were run using Hadoop’s MapReduce engine.

Although Hadapt was 100X faster than Hive for long, complicated queries that involved hundreds of nodes, its reliance on Hadoop MapReduce for parts of query execution precluded sub-second response time for small, simple queries. Therefore, in 2012, Hadapt started to build a secondary query execution engine called “IQ” which was intended to be used for smaller queries. The idea was that all queries would be fed through a query-analyzer layer before execution. If the query was predicted to be long and complex, it would be fed through Hadapt’s original fault-tolerant MapReduce-based engine. However, if the query would complete in a few seconds or less, it would be fed to the IQ execution engine.

presto graphic blogIn 2013 Hadapt integrated IQ with Apache Tez in order avoid redundant programming efforts, since the primary goals of IQ and Tez were aligned. In particular, Tez was designed as an alternative to MapReduce that can achieve interactive performance for general data processing applications. Indeed, Hadapt was able to achieve interactive performance on a much wider-range of queries when leveraging Tez, than what it was able to achieve previously.

Figure 1: Intertwined Histories of SQL-on-Hadoop Technology

Unfortunately Tez was not quite a perfect fit as a query execution engine for Hadapt’s needs. The largest issue was that before shipping data over the network during distributed operators, Tez first writes this data to local disk. The overhead of writing this data to disk (especially when the size of the intermediate result set was large) precluded interactivity for a non-trivial subset of Hadapt’s query workload. A second problem is that the Hive query operators that are implemented over Tez use (by default) traditional Volcano-style row-by-row iteration. In other words, a single function-invocation for a query operator would process just a single database record. This resulted in a larger number of function calls required to process a large dataset, and poor instruction cache locality as the instructions associated with a particular operator were repeatedly reloaded into the instruction cache for each function invocation. Although Hive and Tez have started to alleviate this issue with the recent introduction of vectorized operators, Hadapt still found that query plans involving joins or SQL functions would fall back to row-by-row iteration.

The Hadapt team therefore decided to refocus its query execution strategy (for the interactive query part of Hadapt’s engine) to Presto, which presented several advantages over Tez. First, Presto pipelines data between distributed query operators directly, without writing to local disk, significantly improving performance for network-intensive queries. Second, Presto query operators are vectorized by default, thereby improving CPU efficiency and instruction cache locality. Third, Presto dynamically compiles selective query operators to byte code, which lets the JVM optimize and generate native machine code. Fourth, it uses direct memory management, thereby avoiding Java object allocations, its heap memory overhead and garbage collection pauses. Overall, Presto is a very advanced piece of software, and very much in line with Hadapt’s goal of leveraging as many techniques from modern parallel database system architecture as possible.

The Teradata Center for Hadoop has thus fully embraced Presto as the core part of its technology strategy for the execution of interactive queries over Hadoop. Consequently, it made logical sense for Teradata to take its involvement in the Presto to the next level. Furthermore, Hadoop is fundamentally an open source project, and in order to become a significant player in the Hadoop ecosystem, Teradata needs to contribute meaningful and important code to the open source community. Teradata’s recent acquisition of Think Big serves as further motivation for such contributions.

Therefore Teradata has announced that it is committed to making open source contributions to Presto, and has allocated substantial resources to doing so. Presto is already used by Silicon Valley stalwarts Facebook, AirBnB, NetFlix, DropBox, and Groupon. However, Presto’s enterprise adoption outside of silicon valley remains small. Part of the reason for this is that ease-of-use and enterprise features that are typically associated with modern commercial database systems are not fully available with Presto. Missing are an out-of the-box simple-to-use installer, database monitoring and administration tools, and third-party integrations. Therefore, Teradata’s initial contributions will focus in these areas, with the goal of bridging the gap to getting Presto widely deployed in traditional enterprise applications. This will hopefully lead to more contributors and momentum for Presto.

For now, Teradata’s new commitments to open source contributions in the Hadoop ecosystem are focused on Presto. Teradata’s commitment to Presto and its commitment to making meaningful contributions to an open source project is an exciting development. It will likely have a significant impact on enterprise-adoption of Presto. Hopefully, Presto will become a widely used open source parallel query execution engine --- not just within the Hadoop community, but due to the generality of its design and its storage layer agnosticism, for relational data stored anywhere.


Learn more or download Presto now.


daniel abadi crop BLOG bio mgmtDaniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). Follow Daniel on Twitter @Daniel_Abadi


I recently participated in a business analytics project for non-profits that, as the planning progressed, seemed like a perfect opportunity to implement an agile approach, except that the work was to be completed in two days! But all the developers would be co-located. We had three objectives that fit the profile of user stories. We would cleanse, analyze, and report on the data and, hopefully, discover some insights. We would have the business stakeholders in the room with us the whole time. But doing all this in two days seemed like agile on steroids to me. And it reminded me of an old Stephen Wright joke, “I put instant coffee in the microwave and almost went back in time!”

So, if you put agile on steroids, can you go back in time? Well, maybe not, but we did accomplish a lot in those two days! The project was a DataDive, a collaboration between the non-profit, DataKind, and Teradata, that was held the two days before the Teradata Partners 2014 conference.

Blog data dive teamsI was a Data Ambassador paired with another Data Ambassador to work with a non-governmental organization (NGO) to prepare for the DataDive and make sure we reached our goals. The NGO that DataKind assigned us to was iCouldBe, an organization that provides on-line mentoring to at-risk kids at over 250 schools in the U.S. Since I am not a data scientist or analyst, I found my role as gathering requirements from the business stakeholders at iCouldBe. I worked with them to prioritize the requirements and identify the expected business value. Sounds like the product owner role in “Scrum” -- right? My partner Data Ambassador worked with the head of IT at iCouldBe to identify the data we needed and worked to prepare it for the data dive. This is similar to a Scrum project, where preparatory work must be completed to be ready for the first sprint.

DataKind wanted us to identify the tasks to accomplish each user story, so I immediately thought about using a task board for the actual DataDive. I created one ahead of time in Excel that identified the tasks for each user story as well as the development and handoff phases for each story. I didn’t realize it at the time, but I was creating a Kanban board (a portion of the board is shown in the picture) that allowed us to track workflow.

Blog - Data dive KanbanOnce I got to the DataDive, I recreated the Kanban board using flip chart paper and used sticky notes for the tasks, much the way it might be done for a real project. The user stories were listed in priority order from top to bottom. The tasks represented the metrics, dimensions, text and other analysis required to address the user stories. Some tasks supported multiple user stories, so we noted those and used that “re-use” to help prioritize. We placed these reusable tasks at the top of the board in the swimlane with the highest priority user story. (Click on the figure at left to enlarge - DataDive Kanban Board - Partial Workflow)


For example, the number of posts and words per post that mentors and mentees made in the online mentoring program was an important metric that iCouldBe wanted to calculate to help identify successful mentee completion of the program. Are mentees that write more posts and words per post more likely to complete the program? This question addresses the first user story. But number of posts and words per post can also be used to analyze the amount of engagement between mentors and their mentees and what areas of the curriculum need to be improved.

As the volunteers arrived, they chose tasks, focusing on the high priority tasks first, wrote their name on the sticky notes, and moved the note to the first development column, which was to review the available data.

blog data dive - whiteboardAt different times during the day, DataKind asked each team to review what they had done so far, and what they planned on doing next, similar to the daily standup in Scrum (and we actually did stand).

As the DataDive progressed to day two, only tasks for user stories 1 and 2 progressed across the board, but I reminded the team that some of the tasks we completed for the first two user stories also helped address the third user story. At the end of the DataDive, to better visually show this, I moved some of the sticky notes from user story 1 into the user story 3 swimlane. This way, we could show the business stakeholders from iCouldBe that, although we focused on the higher priority user stories 1 and 2, we had also partially addressed user story 3.

Although this project did not check all the boxes in being a standard agile implementation, it served as a great opportunity for me to put some agile practices in motion in a real project and learn from it. One of the most important aspects was the close collaboration between the developers and stakeholders. It was great to see how thrilled the stakeholders were with the work we had accomplished in just two days!

While I wish I could go back in time and do the DataDive all over again, as it was a great personal experience for me, instead I’ll look to the future and apply what I’ve learned from this project to my next agile project.

Blog ElissaElisia Getts is a Sr. Product Manager, Certified Scrum Master (CSM), and member of the Teradata Agile COE. She has been with Teradata for 15 years and has over 25 years of experience in IT as a product manager, business/IT consultant, programmer/analyst, and technical writer supporting industries such as travel and hospitality, transportation and logistics, and defense. She is the team’s expert on Scrum.

Your Big Data Initiative may not Require Logical Modeling

Posted on: May 12th, 2015 by Guest Blogger No Comments


By: Don Tonner

Logical Modeling may not be required on your next big data initiative.  From experience, I know when building things from scratch that a model reduces development costs, improves quality, and gets me to market quicker.  So why would I say you may not require logical modeling?

Most data modelers are employed in forward engineering activities in which the ultimate goal is to create a database or an application used by companies to manage their businesses.  The process is generally:

  • Obtain an understanding of the business concepts that the database will serve.
  • Organize the business information into structured data components and constraints—a logical model.
  • Create data stores based on the logical model and let the data population and manipulation begin.

Forward engineering is the act of going from requirements to a finished product. For databases that means starting with a detailed understanding of the information of the business, which is found largely in the minds and practices of the employees of the enterprise. This detailed understanding may be thought of as a conceptual model. Object Role Model diagramVarious methods have evolved to document this conceptual richness; one example is the Object Role Model.

The conceptual model (detailed understanding of the enterprise; not to be confused with a conceptual high level E/R diagram) is transformed into a logical data model, which organizes data into structures upon which relational algebra may be performed. The thinking here is very mathematical. Data can be manipulated mathematically the same way we can manipulate anything else mathematically. Just like you may write an equation that expresses how much material it might take for a 3D printer to create a lamp, you may write an equation to show the difference between the employee populations of two different corporate regions.

The image that most of us have of a data model is not equations, variables or valid operations, but is the visual representation of the structures that represent the variables. Below you can see structures as well as relationships which are a kind of constraint.

UData Structures and Relationshipsltimately these structures and constraints will be converted into data stores, such as tables, columns, indexes and data types, which will be populated with data that may be constrained by some business rules.

Massively parallel data storage architectures are becoming increasingly popular as they address the challenges of storing and manipulating almost unimaginable amounts of data.   The ability is to ingest data quickly is critical as volumes increase. One approach is receiving the data without prior verification of the structure. HDFS files or JSON datatypes are examples of storage that do not require knowledge of the structure prior to loading.

OK, imagine a project where millions of readings from hundreds of sensors from scores of machines are collected every shift, possibly into a data lake. Engineers discover that certain analytics performed on the machine data can potentially alert us to conditions that may warrant operator intervention. Data scientists will create several analytic metrics based on hourly aggregates of the sensor data. What’s the modeler’s role in all this?

The models you are going to use on your big data initiative likely already exist.  All you have to do is find them.

One thing would be to reverse engineer a model of the structures of the big data, which can provide visual clues to the meaning of the data. Keep in mind that big data sources may have rapidly changing schemas, so reverse engineering may have to occur periodically on the same source to gather potential new attributes. Also remember that a database of any kind is an imperfect representation of the logical model, which is itself an imperfect representation of the business. So there is much interpretation required to go from the reverse engineered model to a business understanding of the data.

One would also start reviewing an enterprise data model or the forward engineered data warehouse model. After all, while the big data analytic can help point out which engines are experiencing conditions that need attention, when you can match those engine analytics to the workload that day, the experience level of the operator, the time since the last maintenance, then you are greatly expanding the value of that analytic.

So how do you combine the data together from disparate platforms? A logical modeler in a forward engineering environment assures that all the common things have the same identifiers and data types and this is built into the system. That same skill set needs to be leveraged if there is going to be any success performing cross-platform analytics. The identifiers of the same things on the different platforms need to be cross validated in order to make apples to apples comparisons. If analytics are going to be captured and stored in the existing Equipment Scores section of the warehouse, the data will need to be transformed to the appropriate identifiers and data types. If the data is going to be joined on the fly via Teradata QueryGrid™, knowledge of these id’s and datatypes is essential for success and performance.

There are many other modern modeling challenges, let me know what has your attention.

Don Tonner, Teradata Architecture and Modeling Solutions team Don Tonner is a member of the Architecture and Modeling Solutions team, and has worked on several cool projects such as Teradata Mapping Manager, the unification modules, and Solution Modeling Building Blocks.  He is currently creating an Industry Dimensions development kit and working out how models might be useful when combining information from disparate platforms.  You can also reach him on Twitter, @BigDataDon.


In advance of the upcoming webinar Achieving Pervasive Analytics through Data & Analytic Centricity, Dan Woods, CTO and editor of CITO Research, sat down with Clarke Patterson, senior director, Product Marketing, Cloudera, and Chris Twogood, vice president of Poduct and Services Marketing, Teradata, to discuss some of the ideas and concepts that will be shared in more detail on May 14, 2015.


Having been briefed by Cloudera and Teradata on Pervasive Analytics and Data & Analytic Centricity, I have to say it’s refreshing to hear vendors talk about WHY and HOW big data is important in a constructive way, rather than platitudes and jumping into the technical details of the WHAT which is so often the case.

Let me start by asking you both in your own words to describe Pervasive Analytics and Data & Analytic Centricity, and why this an important concept for enterprises to understand?


During eras of global economic shifts, there is always a key resource discovered that becomes the spark of transformation for organizations that can effectively harness it. Today, that resource is unquestionably ‘data’. Forward-looking companies realize that to be successful, they must leverage analytics in order to provide value to their customers and shareholders. In some cases they must package data in a way that adds value and informs employees, or their customers, by deploying analytics into decisions making processes everywhere. This idea is referred to as pervasive analytics.

I would point to the success that Teradata’s customers have had over the past decades in terms of making analytics pervasive throughout enterprises. The spectrum in which their customer have gained value is comprehensive, from business intelligence reporting and executive dashboards, to advanced analytics, to enabling front line decision makers, and embedding analytics into key operational processes. And while those opportunities remain, the explosion of new data types and breadth of new analytic capabilities is leading successful companies to recognize the need to evolve the way they think about data management and processes in order to harness the value of all their data.


I couldn’t agree more. It’s interesting now that we’re several years into the era of big data to see how different companies have approached this opportunity, which really boils down to two approaches. Some companies have taken the approach of what can we do with this newer technology that has emerged, while others take the approach of defining a strategic vision for the role of the data and analytics to support their business objectives and then map the technology to the strategy. The former, which we refer to as an application centric approach, can result in some benefits, but typically runs out of steam as agility slows and new costs and complexities emerge; while the latter is proving to create substantially more competitive advantage as organizations put data and analytics – not a new piece of technology – at the center of their operations. Ultimately, these companies that take a data and analytic centric approach are coming to a conclusion that there are multiple technologies required, and their acumen on applying the-right-tool-to-the-right-job naturally progresses, and the usual traps and pitfalls are avoided.


Would you elaborate on what is meant by “companies need to evolve the way they think about data management?”


Pre “big data,” there was a single approach to data integration whereby data is made to look the same or normalized in some sort of persistence such as a database, and only then can value be created. The idea is that by absorbing the costs of data integration up front, the costs of extracting insights decreases. We call this approach “tightly coupled.” This is still an extremely valuable methodology, but is no longer sufficient as a sole approach to manage all data in the enterprise.

Post “big data,” using the same tightly coupled approach to integration undermines the value of newer data sets that have unknown or under-appreciated value. Here, new methodologies to “loosely couple” or not couple at all are essential to cost effectively manage and integrate the data.   These distinctions are incredibly helpful in understanding the value of Big Data, where best to think about investments, and highlighting challenges that remain a fundamental hindrance to most enterprises.

But regardless of how the data is most appropriately managed, the most important thing is to ensure that organizations retain the ability to connect-the-dots for all their data, in order to draw correlations between multiple subject areas and sources and foster peak agility.


I’d also cite that leading companies are evolving the way they approach analytics. We can analyze any kind of data now - numerical, text, audio, video. We are now able to discover insights in this complex data. Further, new forms of procedural analytics have emerged in the era of big data, such as graph, time-series, machine learning, and text analytics.

This allows us to expand our understanding of the problems at hand. Key business imperatives like churn reduction, fraud detection, increasing sales and marketing effectiveness, and operational efficiencies are not new, and have been skillfully leveraged by data driven businesses with tightly coupled methods and SQL based analytics – that’s not going away. But when organizations harness newer forms of data that adds to the picture, and new complimentary analytic techniques, they realize better churn and fraud models, greater sales and marketing effectiveness, and more efficient business operations.

To learn more, please join the Achieving Pervasive Analytics through Data & Analytic Centricity webinar on Thursday, May 14 the from 10 - 11:00am PT

Making SAP data relevant in the world of big data

Posted on: May 4th, 2015 by Patrick Teunissen No Comments


Part one of series about an old “SAP”dog who learns a new trick

Reflecting back on the key messages from Teradata Universe 2015 in April it was impossible to escape the theme of deriving differentiated business value leveraging the latest data sources and analytic techniques. I heard from several customers how they improved their business by combining their traditional warehouse data (or ‘SAP data’ for us old dogs) with other data from across the enterprise and applying advanced analytic techniques. A special interest group dedicated a whole morning exploring the value of integrating ‘SAP data’ with ‘other data’. As I sat thru these sessions, I found it funny that companies that run SAP ERP always speak about their data in terms of SAP data and other data. It made me wonder what is ‘other data’ and what makes it so special?

In most cases, ‘other data’ is referred to as ‘Big Data’. The term is quite ubiquitous and was used to describe just about every data source. But it’s important to note, that throughout the sessions I attended, none of the companies referred to their SAP data as Big Data. Big Data was a term reserved for the (relatively) new sources of data like machine generated data from the Internet of Things, call center details, POS related data, and social media/web logs.

Although not “big”, customers told me they considered their SAP ERP applications to be complex fortresses of data. In comparison to traditional data warehouses or big data stores, SAP is very difficult to extract and integrate with their ‘other data’. Even SAP acknowledges this shortcoming as evidenced by their recent programs to ‘Simplify’ their complex applications. But I’d argue that while SAP ERPs maybe complex to run, the data that is processed in these applications is quite simple. SAP experts would agree that if you know where to look, the data is both simple and reliable.

Unfortunately these experts live in a world of their own which is focused entirely on data that flows thru SAP. But as evidenced by the customers at Teradata Universe the lion’s share of new IT projects/ business initiatives are focused on leveraging ‘big data’. Which means the folks who know SAP are rarely involved in the IT projects involving ‘big data’, and vice versa, which explains the chasm between SAP and ‘other data’. The ‘Big Data’ folks don’t understand the valuable context that SAP brings. And the ‘SAP data’ folks don’t understand the new insights that analytics on the ‘other data’ can deliver.

However, the tides are turning and the general consensus has evolved to agree that there is value in brining SAP data together with big data. SAP ERP is used primarily for managing the transactional processes in the financial, logistics, manufacturing, and administration functions. This means it houses high quality master data, attribute data, and detailed facts about the business. Combining this structured and reliable data up to multi-structured big data can add valuable confidence and context to the analytics that matter most to businesses today!

Here’s a recent example where a customer integrated the results of advanced text analytics with their SAP ERP data within their Teradata warehouse. The data science team was experimenting with a number of Aster machine learning and natural language processing techniques to find meaning and insight in field technician reports. Using one of Aster’s text analytic methods, Latent Dirichlet Allocation, they were able to identify common related word groups within their reports to identify quality events such as “broken again” or “running as expected”. However they discovered unexpected insight regarding equipment suppliers and 3rd party service providers also hidden in the field reports, such as “Supplier XYZ is causing problems” or “ABC is easy to work with”. They were then able to integrate all of these relatable word groups with context from the SAP ERP purchasing history data stored in the warehouse to provide additional insight and enrichment to their supplier scores.



00-11-HC-QOC-BQ-DataLabZoomed in view of Data Analytics Graph
(Healthcare Example)

<---- Click on image to view GRAPH ANIMATION

In the first part of this two part blog series, I discussed the competitive importance of cross-functional analytics [1]. I also proposed that by treating Data and Analytics as a network of interconnected nodes in Gephi [2], we can examine a statistical metric for analytics called Degree Centrality [3]. In this second part of the series I will now examine parts of the sample Healthcare industry graph animation in detail and draw some high level conclusions from the Degree Centrality measurement for analytics.

In this sample graph [4], link analysis was performed on a network of 3428 nodes and 8313 directed edges. Majority of the nodes represent either Analytics or Source Data Elements. Many Analytics in this graph tend to require data from multiple source systems resulting in cross functional Degree Centrality (connectedness). Some of the Analytics in this study display more Degree Centrality than others.

The zoomed-in visualization starts with a single source system (green) with its data elements (cyan). Basic function specific analytics (red) can be performed with this single Clinical source system data. Even advanced analytics (Text Analysis) can be applied to this single source of data to yield function specific insights.

But data and business never exist separately in isolation. Usually cross-functional analytics emerge with users looking to gain additional value from combining data from various source systems. Notice how these new analytics are using data from source systems in multiple functional areas such as Claims and Membership. Such cross functional data combination or data coupling can now be supported at various levels of sophistication. For instance, data can be loosely coupled for analysis with data virtualization, or if requirements dictate, it can be tightly coupled within a relational Integrated Data Warehouse.

As shown in the graph, even advanced analytics such as Time Series and Naïve Bayes can utilize data from multiple source systems. A data platform that can loosely couple or combine data for such cross-functional advanced analytics can be critical for efficient discovering insights from new sources of data (see discovery platform). More importantly as specific advanced analytics are eventually selected for operationalization, a data platform needs to easily integrate results and support easy access regardless of where the advanced analytics are being performed.

Degree Ranking for sample Analytics from the Healthcare Industry Graph

Degree Analytic Label
3 How can we reduce manual effort required to evaluate physician notes and medical records in conjunction with billing procedure codes?
10 How can number of complaints to Medicare be reduced in an effort to improve the overall STAR rating?
22 What is the ratio of surgical errors to hospital patients? And total medical or surgical errors? (Provider, Payer)
47 What providers are active in what networks and products? What is the utilization? In total, by network, by product
83 What are the trends over time for utilization for patients who use certain channels?
104 What is the cost of care PMPM?   For medical, For Pharmacy, Combined.   How have clinical interventions impacted this cost over time?

The sample analytics listed above demonstrate varying degree of cross-functional Degree Centrality and should be supported with varying level of data coupling. This can range from non-coupled data to loosely coupled data to tightly coupled data. As the number of Analytics with cross-functional Degree Centrality cluster together it may indicate a need to employ tighter data coupling or data integration to drive consistency in the results being obtained. The clustering of Analytics may also be an indication of an emerging need for a data mart or extension of Integrated Data Warehouse that can be utilized by a broader audience.

In-Degree Ranking for sample Data Elements from the Healthcare Industry Graph

In-Degree Source Element
46 Accounts Receivable*PROVIDER BILL-Bill Payer Party Id
31 Clinical*APPLICATION PRODUCT-Product Id
25 Medical Claims*CLAIM-Claim Num
25 Membership*MEMBER-Agreement Id

Similarly if Data start to show high Degree Centrality it may be an indication for re-assessing whether there is a need for tighter coupling to drive consistency and enable broader data reuse. When the In-Degree metric is applied, Data being used by more Analytics appears larger on the graph and is a likely candidate for tighter coupling. To support data design for tighter coupling from a cross functional and even a cross industry perspective Teradata offers reference data model blueprints by industry. (See Teradata Data Models)

This calls for a data management ecosystem with data analytics platforms that can easily harvest this cross-functional Degree Centrality of Analytics and Data. Such a data management ecosystem would support varying degrees of data coupling, varying types of analytics, and varying types of data access based on data users. (Learn more about Teradata’s Unified Data Architecture.)

The analysis described above is exploratory and by no means a replacement for a thorough architectural assessment. Eventually the decision to employ the right degree of data coupling should rest on the full architecture requirements including but not limited to data integrity, security, or business value.

In conclusion, what our experiences have taught us in the past will still hold true for the future:
• Data sources are exponentially more valuable when combined or integrated with other data sets
• To maintain sustained competitive advantage business has to continue to search for insights building on the cross-functional centrality of data
• Unified data management ecosystems can now harvest this cross-functional centrality of data at a lower cost with efficient support for varying levels of data integration, analytic types, and users

Contact Teradata to learn more about how Teradata technology, architecture, and industry expertise can efficiently and effectively harvest this centrality of Data and Analytics.


[2] Gephi is a tool to explore and understand graphs. It is a complementary tool to traditional statistics.

[3] Degree centrality is defined as the number of links incident upon a node (i.e., the number of ties that a node has).

[4] This specific industry example is illustrative and subject to the limitations of assumptions and quality of the sample data mappings used for this study.

Ojustwin blog bio



Ojustwin Naik (MBA, JD) is a Director with 15 years of experience in planning, development, and delivery of Analytics. He has experience across multiple industries and is passionate at nurturing a culture of innovation based on clarity, context, and collaboration.



High Level Data Analytics Graph
(Healthcare Example)

 <---- Click on image to view GRAPH ANIMATION

Michael Porter, in an excellent article in the November 2014 issue of the Harvard Business Review[1], points out that smart connected products are broadening competitive boundaries to encompass related products that meet a broader underlying need. Porter elaborates that the boundary shift is not only from the functionality of discrete products to cross-functionality of product systems, but in many cases expanding to a system of systems such as a smart home or smart city.

So what does all this mean from a data perspective? In that same article, Porter mentions that companies seeking leadership need to invest in capturing, coordinating, and analyzing more extensive data across multiple products and systems (including external information). The key take-away is that the movement of gaining competitive advantage by searching for cross-functional or cross-system insights from data is only going to accelerate and not slow down. Exploiting cross-functional or cross-system centrality of data better than anyone else will continue to remain critical to achieving a sustainable competitive advantage.

Understandably, as technology changes, the mechanisms and architecture used to exploit this cross-system centrality of data will evolve. Current technology trends point to a need for a data & analytic-centric approach that leverages the right tool for the right job and orchestrates these technologies to mask complexity for the end users; while also managing complexity for IT in a hybrid environment. (See this article published in Teradata Magazine.)

As businesses embrace the data & analytic-centric approach, the following types of questions will need to be addressed: How can business and IT decide on when to combine which data and to what degree? What should be the degree of data integration (tight, loose, non-coupled)? Where should the data reside and what is the best data modeling approach (full, partial, need based)? What type of analytics should be applied on what data?

Of course, to properly address these questions, an architecture assessment is called for. But for the sake of going beyond the obvious, one exploratory data point in addressing such questions could be to measure and analyze the cross-functional/cross-system centrality of data.

By treating data and analytics as a network of interconnected nodes in Gephi[2], the connectedness between data and analytics can be measured and visualized for such exploration. We can examine a statistical metric called Degree Centrality[3] which is calculated based on how well an analytic node is connected.

The high level sample data analytics graph demonstrates the cross-functional Degree Centrality of analytics from an Industry specific perspective (Healthcare). It also amplifies, from an industry perspective, the need for organizations to build an analytical ecosystem that can easily harness this cross-functional Degree Centrality of data analytics. (Learn more about Teradata’s Unified Data Architecture.)

In the second part of this blog post series we will walk through a zoomed-in view of the graph, analyze the Degree Centrality measurements for sample analytics, and draw some high-level data architecture implications.


[2] Gephi is a tool to explore and understand graphs. It is a complementary tool to traditional statistics.

[3] Degree centrality is defined as the number of links incident upon a node (i.e., the number of ties that a node has).

Ojustwin blog bio

Ojustwin Naik (MBA, JD) is a Director with 15 years of experience in planning, development, and delivery of Analytics. He has experience across multiple industries and is passionate at nurturing a culture of innovation based on clarity, context, and collaboration.