data architecture

Pluralism and Secularity In a Big Data Ecosystem

Posted on: August 25th, 2015 by Guest Blogger No Comments


Solutions around today's analytic ecosystem are too technically driven without focusing on business values. The buzzwords seem to over-compensate the reality of implementation and cost of ownership. I challenge you to view your analytic architecture using pluralism and secularity. Without such a view of this world your resume will fill out nicely but your business values will suffer.

In my previous role, prior to joining Teradata, I was given the task of trying to move "all" of our organization’s BI data to Hadoop. I will share my approach - how best-in-class solutions come naturally when pluralism and secularity are used to support a business-first environment.

Big data has exposed some great insights into what we can, should, and need to do with our data. However, this space is filled with radical opinions and the pressure to "draw a line in the sand" between time-proven methodologies and what we know as "big data." Some may view these spaces moving in opposite directions; however, these spaces will collide. The question is not "if" but "when." What are we doing now to prepare for this inevitability? Hadapt seems to be moving in the right direction in terms of leadership between the two spaces.

Relational Databases
I found many of the data sets in relational databases to be lacking in structure, highly transient, and loosely coupled. Data scientists needed to have quick access to data sets to perform their hypothesis testing.

Continuously requesting IT to rerun their ETL processes was highly inefficient. A data scientist once asked me "Why can't we just dump the data in a Linux mount for exploration?" Schema-on-write was too restrictive as the data scientists could not predefine the attributes for the data set for ingestion. As the data sets became more complex and unstructured, the ETL processes became exponentially more complicated and performance was hindered.

I also found during this exercise that my traditional BI analysts were perplexed with formulating questions about the data. One of the reasons was that businesses did not know what questions to ask. This is a common challenge in the big data ecosystem. We are used to knowing our data and being able to come up with incredible questions about it. The BI analyst's world has been disrupted as they now need to ask "What insights/answers do I have about my data?" – (according to IIya Katsov in one of his blogs).

The product owner of Hadoop was convinced that the entire dataset should be hosted on Amazon Web Services (S3) which would allow our analytics (via Elastiv Map Reduce) to perform at incredible speeds. However, due to various ISO guidelines, the data sets had to be encrypted at rest and in transit which degraded performance by approximately 30 percent.

Without an access path model, logical model, or unified model, business users and data scientists were left with little appetite for unified analytics. Data scientists were on their own guidelines for integrated/ federated/governed/liberated post-discovery analytical sets.

Communication with the rest of the organization became an unattainable goal. The models which came out of discovery were not federated across the organization as there was a disconnect between the data scientists, data architects, Hadoop engineers, and data stewards -- who spoke different languages. Data scientists were creating amazing predictive models and at the same time data stewards were looking for tools to help them provide insight in prediction for the SAME DATA.

Using NoSQL for a specific question on a dataset required a new collection set. To maintain and govern the numerous collections became a burden. There had to be a better way to answer many questions without having a linear relationship to the number of collections instantiated. The answer may be within access path modeling.

Another challenge I faced was when users wanted a graphical representation of the data and the embedded relationships or lack thereof. Are they asking for a data model? The users would immediately say no, since they read in a blog somewhere that data modeling is not required using NoSQL technology.

At the end of this entire implementation I found myself needing to integrate these various platforms for the sake of providing a business-first solution. Maybe the line in the sand isn't a business-first approach? Those that drive Pluralism (a condition or system in which two or more states, groups, principles, sources of authority, etc., coexist) and Secularity (not being devoted to a specific technology or data 'religion') within their analytic ecosystem -- can truly deliver a business-first solution approach while avoiding the proverbial "silver bullet" architecture solutions.

In my coming post, I will share some of the practices for access path modeling within Big Data and how it supports pluralism and secularity within a business-first analytic ecosystem.

Sunile Manjee

Sunile Manjee is a Product Manager in Teradata’s Architecture and Modeling Solutions team. Big Data solutions are his specialty, along with the architecture to support a unified data vision. He has over 12 years of IT experience as a Big Data architect, DW architect, application architect, IT team lead, and 3gl/4gl programmer.

Optimization in Data Modeling 1 – Primary Index Selection

Posted on: July 14th, 2015 by Guest Blogger No Comments


In my last blog I spoke about the decisions that must be made when transforming an Industry Data Model (iDM) from Logical Data Model (LDM) to an implementable Physical Data Model (PDM). However, being able to generate DDL (Data Definition Language) that will run on a Teradata platform is not enough – you also want it to perform well. While it is possible to generate DDL almost immediately from a Teradata iDM, each customer’s needs mandate that existing structures be reviewed against data and access demographics, so that optimal performance can be achieved.

Having detailed data and access path demographics during PDM design is critical to achieving great performance immediately, otherwise it’s simply guesswork. Alas, these are almost never available at the beginning of an installation, but that doesn’t mean you can’t make “excellent guesses.”

The single most influential factor in achieving PDM performance is proper Primary Index (PI) selection for warehouse tables. Data modelers are focused on entity/table Primary Keys (PK) since it is what defines uniqueness at the row level. Because of this, a lot of physical modelers tend to implement the PK as a Unique Primary Index (UPI) on each table as a default. But one of the keys to Teradata’s great performance is that it utilizes the PI to physical distribute data within a table across the entire platform to optimize parallelism. Each processor gets a piece of the table based on the PI, so rows from different tables with the same PI value are co-resident and do not need to be moved when two tables are joined.

In a Third Normal Form (3NF) model no two entities (outside of super/subtypes and rare exceptions) will have the same PK, so if chosen as a PI, it stands to reason that no two tables share a PI and every table join will require data from at least one table to be moved before a join can be completed – not a solid performance decision to say the least.

The iDM’s have preselected PI’s largely based on Identifiers common across subject areas (i.e. Party Id) so that all information regarding that ID will be co-resident and joins will be AMP-local. These non-unique PI’s (NUPI’s) are a great starting point for your PDM, but again need to be evaluated against customer data and access plans to insure that both performance and reasonably even data distribution is achieved.

Even data distribution across the Teradata platform is important since skewed data can contribute both to poor performance and to space allocation (run out of space on one AMP, run out of space on all). However, it can be overemphasized to the detriment of performance.

Say, for example, a table has a PI of PRODUCT_ID, and there are a disproportionate number of rows for several Products causing skewed distribution Altering the PI to the table PK instead will provide perfectly even distribution, but remember, when joining to that table, if all elements of the PK are not available then the rows of the table will need to be redistributed, most likely by PRODUCT_ID.

This puts them back under the AMP where they were in the skewed scenario. This time instead of a “rest state” skew the rows will skew during redistribution, and this will happen every time the table is joined to – not a solid performance decision. Optimum performance can therefore be achieved with sub-optimum distribution.

iDM tables relating two common identifiers will usually have one of the ID’s pre-selected as a NUPI. In some installations the access demographics will show that other ID may be the better choice. If so, change it! Or it may give leave you with no clear choice, in which case picking one is almost assuredly better than
changing the PI to a composite index consisting of both ID’s as this will only result in a table no longer co-resident with any table indexed by either of the ID’s alone.

There are many other factors that contribute to achieving optimal performance of your physical model, but they all pale in comparison to a well-chosen PI. In my next blog we’ll look at some more of these and discuss when and how best to implement them.

Jake Kurdsjuk Biopic-resize July 15

Jake Kurdsjuk is Product Manager for the Teradata Communications Industry Data Model, purchased by more than one hundred Communications Service Providers worldwide. Jake has been with Teradata since 2001 and has 25 years of experience working with Teradata within the Communications Industry, as a programmer, DBA, Data Architect and Modeler.

Why We Love Presto

Posted on: June 24th, 2015 by Daniel Abadi No Comments


Concurrent with acquiring Hadoop companies Hadapt and Revelytix last year, Teradata opened the Teradata Center for Hadoop in Boston. Teradata recently announced that a major new initiative of this Hadoop development center will include open-source contributions to a distributed SQL query engine called Presto. Presto was originally developed at Facebook, and is designed to run high performance, interactive queries against Big Data wherever it may live --- Hadoop, Cassandra, or traditional relational database systems.

Among those people who will be part of this initiative and contributing code to Presto include a subset of the Hadapt team that joined Teradata last year. In the following, we will dive deeper into the thinking behind this new initiative from the perspective of the Hadapt team. It is important to note upfront that Teradata’s interest in Presto, and the people contributing to the Presto codebase, extends beyond the Hadapt team that joined Teradata last year. Nonetheless, it is worthwhile to understand the technical reasoning behind the embrace of Presto from Teradata, even if it presents a localized view of the overall initiative.

Around seven years ago, Ashish Thusoo and his team at Facebook built the first SQL layer over Hadoop as part of a project called Hive. At its essence, Hive was a query translation layer over Hadoop: it received queries in a SQL-like language called Hive-QL, and transformed them into a set of MapReduce jobs over data stored in HDFS on a Hadoop cluster. Hive was truly the first project of its kind. However, since its focus was on query translation into the existing MapReduce query execution engine of Hadoop, it achieved tremendous scalability, but poor efficiency and performance, and ultimately led to a series of subsequent SQL-on-Hadoop solutions that claimed 100X speed-ups over Hive.

Hadapt was the first such SQL-on-Hadoop solution that claimed a 100X speed-up over Hive on certain types of queries. Hadapt was spun out of the HadoopDB research project from my team at Yale and was founded by a group of Yale graduates. The basic idea was to develop a hybrid system that is able to achieve the fault-tolerant scalability of the Hive MapReduce query execution engine while leveraging techniques from the parallel database system community to achieve high performance query processing.

The intention of HadoopDB/Hadapt was never to build its own query execution layer. The first version of Hadapt used a combination of PostgreSQL and MapReduce for distributed query execution. In particular, the query operators that could be run locally, without reliance on data located on other nodes in the cluster, were run using PostgreSQL’s query operator set (although Hadapt was written such that PostgreSQL could be replaced by any performant single-node database system). Meanwhile, query operators that required data exchange between multiple nodes in the cluster were run using Hadoop’s MapReduce engine.

Although Hadapt was 100X faster than Hive for long, complicated queries that involved hundreds of nodes, its reliance on Hadoop MapReduce for parts of query execution precluded sub-second response time for small, simple queries. Therefore, in 2012, Hadapt started to build a secondary query execution engine called “IQ” which was intended to be used for smaller queries. The idea was that all queries would be fed through a query-analyzer layer before execution. If the query was predicted to be long and complex, it would be fed through Hadapt’s original fault-tolerant MapReduce-based engine. However, if the query would complete in a few seconds or less, it would be fed to the IQ execution engine.

presto graphic blogIn 2013 Hadapt integrated IQ with Apache Tez in order avoid redundant programming efforts, since the primary goals of IQ and Tez were aligned. In particular, Tez was designed as an alternative to MapReduce that can achieve interactive performance for general data processing applications. Indeed, Hadapt was able to achieve interactive performance on a much wider-range of queries when leveraging Tez, than what it was able to achieve previously.

Figure 1: Intertwined Histories of SQL-on-Hadoop Technology

Unfortunately Tez was not quite a perfect fit as a query execution engine for Hadapt’s needs. The largest issue was that before shipping data over the network during distributed operators, Tez first writes this data to local disk. The overhead of writing this data to disk (especially when the size of the intermediate result set was large) precluded interactivity for a non-trivial subset of Hadapt’s query workload. A second problem is that the Hive query operators that are implemented over Tez use (by default) traditional Volcano-style row-by-row iteration. In other words, a single function-invocation for a query operator would process just a single database record. This resulted in a larger number of function calls required to process a large dataset, and poor instruction cache locality as the instructions associated with a particular operator were repeatedly reloaded into the instruction cache for each function invocation. Although Hive and Tez have started to alleviate this issue with the recent introduction of vectorized operators, Hadapt still found that query plans involving joins or SQL functions would fall back to row-by-row iteration.

The Hadapt team therefore decided to refocus its query execution strategy (for the interactive query part of Hadapt’s engine) to Presto, which presented several advantages over Tez. First, Presto pipelines data between distributed query operators directly, without writing to local disk, significantly improving performance for network-intensive queries. Second, Presto query operators are vectorized by default, thereby improving CPU efficiency and instruction cache locality. Third, Presto dynamically compiles selective query operators to byte code, which lets the JVM optimize and generate native machine code. Fourth, it uses direct memory management, thereby avoiding Java object allocations, its heap memory overhead and garbage collection pauses. Overall, Presto is a very advanced piece of software, and very much in line with Hadapt’s goal of leveraging as many techniques from modern parallel database system architecture as possible.

The Teradata Center for Hadoop has thus fully embraced Presto as the core part of its technology strategy for the execution of interactive queries over Hadoop. Consequently, it made logical sense for Teradata to take its involvement in the Presto to the next level. Furthermore, Hadoop is fundamentally an open source project, and in order to become a significant player in the Hadoop ecosystem, Teradata needs to contribute meaningful and important code to the open source community. Teradata’s recent acquisition of Think Big serves as further motivation for such contributions.

Therefore Teradata has announced that it is committed to making open source contributions to Presto, and has allocated substantial resources to doing so. Presto is already used by Silicon Valley stalwarts Facebook, AirBnB, NetFlix, DropBox, and Groupon. However, Presto’s enterprise adoption outside of silicon valley remains small. Part of the reason for this is that ease-of-use and enterprise features that are typically associated with modern commercial database systems are not fully available with Presto. Missing are an out-of the-box simple-to-use installer, database monitoring and administration tools, and third-party integrations. Therefore, Teradata’s initial contributions will focus in these areas, with the goal of bridging the gap to getting Presto widely deployed in traditional enterprise applications. This will hopefully lead to more contributors and momentum for Presto.

For now, Teradata’s new commitments to open source contributions in the Hadoop ecosystem are focused on Presto. Teradata’s commitment to Presto and its commitment to making meaningful contributions to an open source project is an exciting development. It will likely have a significant impact on enterprise-adoption of Presto. Hopefully, Presto will become a widely used open source parallel query execution engine --- not just within the Hadoop community, but due to the generality of its design and its storage layer agnosticism, for relational data stored anywhere.


Learn more or download Presto now.


daniel abadi crop BLOG bio mgmtDaniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). Follow Daniel on Twitter @Daniel_Abadi

Your Big Data Initiative may not Require Logical Modeling

Posted on: May 12th, 2015 by Guest Blogger No Comments


By: Don Tonner

Logical Modeling may not be required on your next big data initiative.  From experience, I know when building things from scratch that a model reduces development costs, improves quality, and gets me to market quicker.  So why would I say you may not require logical modeling?

Most data modelers are employed in forward engineering activities in which the ultimate goal is to create a database or an application used by companies to manage their businesses.  The process is generally:

  • Obtain an understanding of the business concepts that the database will serve.
  • Organize the business information into structured data components and constraints—a logical model.
  • Create data stores based on the logical model and let the data population and manipulation begin.

Forward engineering is the act of going from requirements to a finished product. For databases that means starting with a detailed understanding of the information of the business, which is found largely in the minds and practices of the employees of the enterprise. This detailed understanding may be thought of as a conceptual model. Object Role Model diagramVarious methods have evolved to document this conceptual richness; one example is the Object Role Model.

The conceptual model (detailed understanding of the enterprise; not to be confused with a conceptual high level E/R diagram) is transformed into a logical data model, which organizes data into structures upon which relational algebra may be performed. The thinking here is very mathematical. Data can be manipulated mathematically the same way we can manipulate anything else mathematically. Just like you may write an equation that expresses how much material it might take for a 3D printer to create a lamp, you may write an equation to show the difference between the employee populations of two different corporate regions.

The image that most of us have of a data model is not equations, variables or valid operations, but is the visual representation of the structures that represent the variables. Below you can see structures as well as relationships which are a kind of constraint.

UData Structures and Relationshipsltimately these structures and constraints will be converted into data stores, such as tables, columns, indexes and data types, which will be populated with data that may be constrained by some business rules.

Massively parallel data storage architectures are becoming increasingly popular as they address the challenges of storing and manipulating almost unimaginable amounts of data.   The ability is to ingest data quickly is critical as volumes increase. One approach is receiving the data without prior verification of the structure. HDFS files or JSON datatypes are examples of storage that do not require knowledge of the structure prior to loading.

OK, imagine a project where millions of readings from hundreds of sensors from scores of machines are collected every shift, possibly into a data lake. Engineers discover that certain analytics performed on the machine data can potentially alert us to conditions that may warrant operator intervention. Data scientists will create several analytic metrics based on hourly aggregates of the sensor data. What’s the modeler’s role in all this?

The models you are going to use on your big data initiative likely already exist.  All you have to do is find them.

One thing would be to reverse engineer a model of the structures of the big data, which can provide visual clues to the meaning of the data. Keep in mind that big data sources may have rapidly changing schemas, so reverse engineering may have to occur periodically on the same source to gather potential new attributes. Also remember that a database of any kind is an imperfect representation of the logical model, which is itself an imperfect representation of the business. So there is much interpretation required to go from the reverse engineered model to a business understanding of the data.

One would also start reviewing an enterprise data model or the forward engineered data warehouse model. After all, while the big data analytic can help point out which engines are experiencing conditions that need attention, when you can match those engine analytics to the workload that day, the experience level of the operator, the time since the last maintenance, then you are greatly expanding the value of that analytic.

So how do you combine the data together from disparate platforms? A logical modeler in a forward engineering environment assures that all the common things have the same identifiers and data types and this is built into the system. That same skill set needs to be leveraged if there is going to be any success performing cross-platform analytics. The identifiers of the same things on the different platforms need to be cross validated in order to make apples to apples comparisons. If analytics are going to be captured and stored in the existing Equipment Scores section of the warehouse, the data will need to be transformed to the appropriate identifiers and data types. If the data is going to be joined on the fly via Teradata QueryGrid™, knowledge of these id’s and datatypes is essential for success and performance.

There are many other modern modeling challenges, let me know what has your attention.

Don Tonner, Teradata Architecture and Modeling Solutions team Don Tonner is a member of the Architecture and Modeling Solutions team, and has worked on several cool projects such as Teradata Mapping Manager, the unification modules, and Solution Modeling Building Blocks.  He is currently creating an Industry Dimensions development kit and working out how models might be useful when combining information from disparate platforms.  You can also reach him on Twitter, @BigDataDon.

Making SAP data relevant in the world of big data

Posted on: May 4th, 2015 by Patrick Teunissen No Comments


Part one of series about an old “SAP”dog who learns a new trick

Reflecting back on the key messages from Teradata Universe 2015 in April it was impossible to escape the theme of deriving differentiated business value leveraging the latest data sources and analytic techniques. I heard from several customers how they improved their business by combining their traditional warehouse data (or ‘SAP data’ for us old dogs) with other data from across the enterprise and applying advanced analytic techniques. A special interest group dedicated a whole morning exploring the value of integrating ‘SAP data’ with ‘other data’. As I sat thru these sessions, I found it funny that companies that run SAP ERP always speak about their data in terms of SAP data and other data. It made me wonder what is ‘other data’ and what makes it so special?

In most cases, ‘other data’ is referred to as ‘Big Data’. The term is quite ubiquitous and was used to describe just about every data source. But it’s important to note, that throughout the sessions I attended, none of the companies referred to their SAP data as Big Data. Big Data was a term reserved for the (relatively) new sources of data like machine generated data from the Internet of Things, call center details, POS related data, and social media/web logs.

Although not “big”, customers told me they considered their SAP ERP applications to be complex fortresses of data. In comparison to traditional data warehouses or big data stores, SAP is very difficult to extract and integrate with their ‘other data’. Even SAP acknowledges this shortcoming as evidenced by their recent programs to ‘Simplify’ their complex applications. But I’d argue that while SAP ERPs maybe complex to run, the data that is processed in these applications is quite simple. SAP experts would agree that if you know where to look, the data is both simple and reliable.

Unfortunately these experts live in a world of their own which is focused entirely on data that flows thru SAP. But as evidenced by the customers at Teradata Universe the lion’s share of new IT projects/ business initiatives are focused on leveraging ‘big data’. Which means the folks who know SAP are rarely involved in the IT projects involving ‘big data’, and vice versa, which explains the chasm between SAP and ‘other data’. The ‘Big Data’ folks don’t understand the valuable context that SAP brings. And the ‘SAP data’ folks don’t understand the new insights that analytics on the ‘other data’ can deliver.

However, the tides are turning and the general consensus has evolved to agree that there is value in brining SAP data together with big data. SAP ERP is used primarily for managing the transactional processes in the financial, logistics, manufacturing, and administration functions. This means it houses high quality master data, attribute data, and detailed facts about the business. Combining this structured and reliable data up to multi-structured big data can add valuable confidence and context to the analytics that matter most to businesses today!

Here’s a recent example where a customer integrated the results of advanced text analytics with their SAP ERP data within their Teradata warehouse. The data science team was experimenting with a number of Aster machine learning and natural language processing techniques to find meaning and insight in field technician reports. Using one of Aster’s text analytic methods, Latent Dirichlet Allocation, they were able to identify common related word groups within their reports to identify quality events such as “broken again” or “running as expected”. However they discovered unexpected insight regarding equipment suppliers and 3rd party service providers also hidden in the field reports, such as “Supplier XYZ is causing problems” or “ABC is easy to work with”. They were then able to integrate all of these relatable word groups with context from the SAP ERP purchasing history data stored in the warehouse to provide additional insight and enrichment to their supplier scores.



00-11-HC-QOC-BQ-DataLabZoomed in view of Data Analytics Graph
(Healthcare Example)

<---- Click on image to view GRAPH ANIMATION

In the first part of this two part blog series, I discussed the competitive importance of cross-functional analytics [1]. I also proposed that by treating Data and Analytics as a network of interconnected nodes in Gephi [2], we can examine a statistical metric for analytics called Degree Centrality [3]. In this second part of the series I will now examine parts of the sample Healthcare industry graph animation in detail and draw some high level conclusions from the Degree Centrality measurement for analytics.

In this sample graph [4], link analysis was performed on a network of 3428 nodes and 8313 directed edges. Majority of the nodes represent either Analytics or Source Data Elements. Many Analytics in this graph tend to require data from multiple source systems resulting in cross functional Degree Centrality (connectedness). Some of the Analytics in this study display more Degree Centrality than others.

The zoomed-in visualization starts with a single source system (green) with its data elements (cyan). Basic function specific analytics (red) can be performed with this single Clinical source system data. Even advanced analytics (Text Analysis) can be applied to this single source of data to yield function specific insights.

But data and business never exist separately in isolation. Usually cross-functional analytics emerge with users looking to gain additional value from combining data from various source systems. Notice how these new analytics are using data from source systems in multiple functional areas such as Claims and Membership. Such cross functional data combination or data coupling can now be supported at various levels of sophistication. For instance, data can be loosely coupled for analysis with data virtualization, or if requirements dictate, it can be tightly coupled within a relational Integrated Data Warehouse.

As shown in the graph, even advanced analytics such as Time Series and Naïve Bayes can utilize data from multiple source systems. A data platform that can loosely couple or combine data for such cross-functional advanced analytics can be critical for efficient discovering insights from new sources of data (see discovery platform). More importantly as specific advanced analytics are eventually selected for operationalization, a data platform needs to easily integrate results and support easy access regardless of where the advanced analytics are being performed.

Degree Ranking for sample Analytics from the Healthcare Industry Graph

Degree Analytic Label
3 How can we reduce manual effort required to evaluate physician notes and medical records in conjunction with billing procedure codes?
10 How can number of complaints to Medicare be reduced in an effort to improve the overall STAR rating?
22 What is the ratio of surgical errors to hospital patients? And total medical or surgical errors? (Provider, Payer)
47 What providers are active in what networks and products? What is the utilization? In total, by network, by product
83 What are the trends over time for utilization for patients who use certain channels?
104 What is the cost of care PMPM?   For medical, For Pharmacy, Combined.   How have clinical interventions impacted this cost over time?

The sample analytics listed above demonstrate varying degree of cross-functional Degree Centrality and should be supported with varying level of data coupling. This can range from non-coupled data to loosely coupled data to tightly coupled data. As the number of Analytics with cross-functional Degree Centrality cluster together it may indicate a need to employ tighter data coupling or data integration to drive consistency in the results being obtained. The clustering of Analytics may also be an indication of an emerging need for a data mart or extension of Integrated Data Warehouse that can be utilized by a broader audience.

In-Degree Ranking for sample Data Elements from the Healthcare Industry Graph

In-Degree Source Element
46 Accounts Receivable*PROVIDER BILL-Bill Payer Party Id
31 Clinical*APPLICATION PRODUCT-Product Id
25 Medical Claims*CLAIM-Claim Num
25 Membership*MEMBER-Agreement Id

Similarly if Data start to show high Degree Centrality it may be an indication for re-assessing whether there is a need for tighter coupling to drive consistency and enable broader data reuse. When the In-Degree metric is applied, Data being used by more Analytics appears larger on the graph and is a likely candidate for tighter coupling. To support data design for tighter coupling from a cross functional and even a cross industry perspective Teradata offers reference data model blueprints by industry. (See Teradata Data Models)

This calls for a data management ecosystem with data analytics platforms that can easily harvest this cross-functional Degree Centrality of Analytics and Data. Such a data management ecosystem would support varying degrees of data coupling, varying types of analytics, and varying types of data access based on data users. (Learn more about Teradata’s Unified Data Architecture.)

The analysis described above is exploratory and by no means a replacement for a thorough architectural assessment. Eventually the decision to employ the right degree of data coupling should rest on the full architecture requirements including but not limited to data integrity, security, or business value.

In conclusion, what our experiences have taught us in the past will still hold true for the future:
• Data sources are exponentially more valuable when combined or integrated with other data sets
• To maintain sustained competitive advantage business has to continue to search for insights building on the cross-functional centrality of data
• Unified data management ecosystems can now harvest this cross-functional centrality of data at a lower cost with efficient support for varying levels of data integration, analytic types, and users

Contact Teradata to learn more about how Teradata technology, architecture, and industry expertise can efficiently and effectively harvest this centrality of Data and Analytics.


[2] Gephi is a tool to explore and understand graphs. It is a complementary tool to traditional statistics.

[3] Degree centrality is defined as the number of links incident upon a node (i.e., the number of ties that a node has).

[4] This specific industry example is illustrative and subject to the limitations of assumptions and quality of the sample data mappings used for this study.

Ojustwin blog bio



Ojustwin Naik (MBA, JD) is a Director with 15 years of experience in planning, development, and delivery of Analytics. He has experience across multiple industries and is passionate at nurturing a culture of innovation based on clarity, context, and collaboration.



High Level Data Analytics Graph
(Healthcare Example)

 <---- Click on image to view GRAPH ANIMATION

Michael Porter, in an excellent article in the November 2014 issue of the Harvard Business Review[1], points out that smart connected products are broadening competitive boundaries to encompass related products that meet a broader underlying need. Porter elaborates that the boundary shift is not only from the functionality of discrete products to cross-functionality of product systems, but in many cases expanding to a system of systems such as a smart home or smart city.

So what does all this mean from a data perspective? In that same article, Porter mentions that companies seeking leadership need to invest in capturing, coordinating, and analyzing more extensive data across multiple products and systems (including external information). The key take-away is that the movement of gaining competitive advantage by searching for cross-functional or cross-system insights from data is only going to accelerate and not slow down. Exploiting cross-functional or cross-system centrality of data better than anyone else will continue to remain critical to achieving a sustainable competitive advantage.

Understandably, as technology changes, the mechanisms and architecture used to exploit this cross-system centrality of data will evolve. Current technology trends point to a need for a data & analytic-centric approach that leverages the right tool for the right job and orchestrates these technologies to mask complexity for the end users; while also managing complexity for IT in a hybrid environment. (See this article published in Teradata Magazine.)

As businesses embrace the data & analytic-centric approach, the following types of questions will need to be addressed: How can business and IT decide on when to combine which data and to what degree? What should be the degree of data integration (tight, loose, non-coupled)? Where should the data reside and what is the best data modeling approach (full, partial, need based)? What type of analytics should be applied on what data?

Of course, to properly address these questions, an architecture assessment is called for. But for the sake of going beyond the obvious, one exploratory data point in addressing such questions could be to measure and analyze the cross-functional/cross-system centrality of data.

By treating data and analytics as a network of interconnected nodes in Gephi[2], the connectedness between data and analytics can be measured and visualized for such exploration. We can examine a statistical metric called Degree Centrality[3] which is calculated based on how well an analytic node is connected.

The high level sample data analytics graph demonstrates the cross-functional Degree Centrality of analytics from an Industry specific perspective (Healthcare). It also amplifies, from an industry perspective, the need for organizations to build an analytical ecosystem that can easily harness this cross-functional Degree Centrality of data analytics. (Learn more about Teradata’s Unified Data Architecture.)

In the second part of this blog post series we will walk through a zoomed-in view of the graph, analyze the Degree Centrality measurements for sample analytics, and draw some high-level data architecture implications.


[2] Gephi is a tool to explore and understand graphs. It is a complementary tool to traditional statistics.

[3] Degree centrality is defined as the number of links incident upon a node (i.e., the number of ties that a node has).

Ojustwin blog bio

Ojustwin Naik (MBA, JD) is a Director with 15 years of experience in planning, development, and delivery of Analytics. He has experience across multiple industries and is passionate at nurturing a culture of innovation based on clarity, context, and collaboration.


It is well-known that there are two extreme alternatives for storing database tables on any storage media: storing it row-by-row (as done by traditional “row-store” technology) or storing it column-by-column (as done by recently popular “column-store” implementations). Row-stores store the entire first row of the table, followed by the entire second row of the table, etc. Column-stores store the entire first column of the table, followed by the entire second column of the table, etc. There have been huge amounts of research literature and commercial whitepapers that discuss the various advantages of these alternative approaches, along with various proposals for hybrid solutions (which I discussed in more detail in my previous post).

abadi blog clamp image abadiDespite the many conflicting arguments in favor of these different approaches, there is little question that column-stores compress data much better than row-stores. The reason is fairly intuitive: in a column-store, entire columns are stored contiguously --- in other words, a series of values from the same attribute domain are stored consecutively. In a row-store, values from different attribute domains are interspersed, thereby reducing the self-similarity of the data. In general the more self-similarity (lower entropy) you have in a dataset, the more compressible it is. Hence, column-stores are more compressible than row-stores.

In general, compression rates are very sensitive to the particular dataset that is being compressed. Therefore it is impossible to make any kind of guarantees about how much a particular database system/compression algorithm will compress an arbitrary dataset. However, as a general rule of thumb, it is reasonable to expect around 8X compression if a column-store is used on many kinds of datasets. 8X compression means that the compressed dataset is 1/8th the original size, and scan-based queries over the dataset can thus proceed approximately 8 times as fast. This stellar compression and resulting performance improvements are a major contributor to the recent popularity of column-stores.

It is precisely this renowned compression of column-stores which makes the compression rate of RainStor (a recent Teradata acquisition) so impressive in comparison. RainStor claims a factor of 5 times more compression than what column-stores are able to achieve on the same datasets, and 40X compression overall.

Although the reason why column-stores compress data better than row-stores is fairly intuitive, the reason why RainStor can compress data better than column-stores is less intuitive. Therefore, we will now explain this in more detail.

Take for example the following table, which is a subset of a table describing orders from a particular retail enterprise that sells bicycles and related parts. (A real table would have many more rows and columns, but we keep this example simple so that it is easier to understand what is going on).

Record Order date Ship date Product Price
1 03/22/2015 03/23/2015 “bicycle” 300
2 03/22/2015 03/24/2015 “lock” 18
3 03/22/2015 03/24/2015 “tire” 70
4 03/22/2015 03/23/2015 “lock” 18
5 03/22/2015 03/24/2015 “bicycle” 250
6 03/22/2015 03/23/2015 “bicycle” 280
7 03/22/2015 03/23/2015 “tire” 70
8 03/22/2015 03/23/2015 “lock” 18
9 03/22/2015 03/24/2015 “bicycle” 280
10 03/23/2015 03/24/2015 “lock” 18
11 03/23/2015 03/25/2015 “bicycle” 300
12 03/23/2015 03/24/2015 “bicycle” 280
13 03/23/2015 03/24/2015 “tire” 70
14 03/23/2015 03/25/2015 “bicycle” 250
15 03/23/2015 03/25/2015 “bicycle” 280


The table contains 15 records and shows four attributes --- the order and ship dates of a particular product; the product that was purchased, and the purchase price. Note that there is a relationship between some of these columns --- in particular the ship date is usually 1 or 2 days after the order date, and that the price of various products are usually consistent across orders, but there may be slight variations in price depending on what coupons the customer used to make the purchase.

A column-store would likely use “run-length encoding” to compress the order date column. Since records are sorted by order date, this would compress the column to its near-minimum --- it can be compressed as (03/22/2015, 9); (03/23/2015, 6) --- which indicates that 03/22/2015 is repeated 9 straight times, followed by 03/23/2015 which is repeated 6 times. The ship date column, although not sorted, is still very compressible, as each value can be expressed using a small number of bits in terms of how much larger (or smaller) it is from the previous value in the column. However, the other two columns --- product and price --- would likely be compressed using a variant of dictionary compression, where each value is mapped to the minimal number of bits needed represent it. For large datasets, where there are many unique values for price (or even for product), the number of bits needed to represent a dictionary entry is non-trivial, and the same dictionary entry is repeated in the compressed dataset for every repeated value in the original dataset.

In contrast, in RainStor, every unique value in the dataset is stored once (and only once), and every record is represented as a binary tree, where a breadth-first traversal of the tree enables the reconstruction of the original record. For example, the table shown above is compressed in RainStor using the forest of binary trees shown below. There are 15 binary trees (each of the 15 roots of these trees are shown using the green circles at the top of the figure), corresponding to the 15 records in the original dataset.abadi forest trees blog

Forest of Binary Trees Compression

For example, the binary tree corresponding to record 1 is shown on the left side of the figure. The root points to two children --- the internal nodes “A” and “E”. In turn, node “A” points to 03/22/2015 (corresponding to the order date of record 1), and to 03/23/2015 (corresponding to the ship date of record 1). Node “E” points to “bicycle” (corresponding to the product of record 1) and “300” corresponding to the price of record 1).

Note that records 4, 6, and 7 also have an order date of 03/22/2015 and a ship date of 03/23/2015. Therefore, the roots of the binary trees corresponding to those records also point to internal node “A”. Similarly, note that record 11 also is associated with the purchase of a bicycle for $300. Therefore, the root for record 11 also points to internal node “E”.

These shared internal nodes are what makes RainStor’s compression algorithm fundamentally different from any algorithm that a column-store is capable of performing. Column-stores are forced to create dictionaries and search for patterns only within individual columns. In contrast, RainStor’s compression algorithm finds patterns across different columns --- identifying the relationship between ship date and order date and the relationship between product and price, and leveraging these relationships to share branches in the trees that are formed, thereby eliminating redundant information. RainStor thus has fundamentally more room to search for patterns in the dataset and compress data by referencing these patterns via the (compressed) location of the root of the shared branch.

For a traditional archiving solution, compression rate is arguably the most important feature (right up there with immutability). Indeed, RainStor’s compression algorithm enables it to be used for archival use-cases, and RainStor provides all of the additional features you would expect from an archiving solution: encryption, LDAP/AD/PAM/Kerberos/PCI authentication and security, audit trails and logging, retention rules, expiry policies, and integrated implementation of existing compliance standards (e.g. SEC 17a-4).

However, what brings RainStor to the next level in the archival solutions market is that it is an “active” archive, meaning that the data that is managed by RainStor can be queried at high performance. RainStor provides a mature SQL stack for native querying of compressed RainStor data, including ANSI SQL 1992 and 2003 parsers, and a full MPP query execution engine. For enterprises with Hadoop clusters, RainStor is fully integrated with the Cloudera and Hortonworks distributions of Hadoop --- RainStor compressed data files can be partitioned over a HDFS cluster, and queried in parallel with HiveQL (or MapReduce or Pig). Furthermore, RainStor integrates with YARN for resource management, with HCatalog for metadata management, and with Ambari for system monitoring and management.

The reason why most archival solutions are not “active” is that the compression algorithms used to reduce the data size before archival are so heavy-weight, that significant processing resources must be invested in decompressing the data before it can be queried. Therefore, it is preferable to leave the data archived in compressed form, and only decompress it at times of significant need. In general, a user should expect significant query performance reductions relative to querying uncompressed data, in order to account for the additional decompression time.

The beauty of RainStor’s compression algorithm is that even though it gets compression ratios comparable to other archival products, its compression algorithm is not so heavy-weight that the data must be decompressed prior to querying it. In particular, the binary tree structures shown above are actually fairly straightforward to perform query operations on directly, without requiring decompression prior to access. For example, a count distinct or a group-by operation can be performed via a scan of the leaves of the binary tees. Furthermore, selections can be performed via a reverse traversal of the binary trees from the leaves that match the selection predicate. In general, since there is a one-to-one mapping of records in the uncompressed dataset to the binary trees in RainStor’s compressed files, all query operations can be expressed in terms of operations on these binary trees. Therefore, RainStor queries can benefit from the I/O improvement of scanning in less data (due to the smaller size of the compressed files on disk/memory) without paying the decompression cost to fully decompress these compressed files after they are read from storage. This leads to RainStor’s claims of 2X-100X performance improvement on most queries --- an industry-leading claim in the archival market.

In short, RainStor’s strong claims around compression and performance are backed up by the technology that is used under the covers. Its compression algorithm is able to identify and remove redundancy both within and across columns. Furthermore, the resulting data structures produced by the algorithm are amenable to direct operation on the compressed data. This allows the compressed files to be queried at high performance, and positions RainStor as a leading active-archive solution.


daniel abadi crop BLOG bio mgmt

Daniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata).!/daniel_abadi.

Modeling for Intelligence: Conceptual, Logical, Physical

Posted on: March 24th, 2015 by Guest Blogger No Comments


It’s a fact: intelligence in big business, big data and big analytics begins with data modelers. And understand this: we data modelers are a special breed: we LIKE translating business requirements into conceptual, logical, and physical data models. We enjoy analytical work with tools like Entity Relationship Diagrams (ERD), which are valuable formats that we use to describe data in a business domain. Really? Yes, an ERD is cool; it uses “entities” to define business terminology (or objects) and business relationships, as well as all of the data points as attributes. And there are many ways a data modeler can work with the various data types to create models for actionable intelligence. Let’s talk about ERD’s.

First, there is the quintessential Conceptual Data Model; it shows the business terminology that business leaders use every day. This model is usually displayed on a single page and can be understood and appreciated very quickly. Because of this, it usually shows only the highest-level terms, omitting lower-level detail. This model does not have any attributes associated with the entities (or nouns) but it does show the relationship between those entities. Such objects as "Sales Order," "Customer," and "Payment" may be seen here. This model can be used to outline the business quickly and to vet the overall correctness of the model. This is also a good model to show to people who are not technically oriented but know the business well. Changes in the business here should be reflected in the Logical Data Model. Key Point: This model does not reveal any business rules, rather, just connected high-level business concepts.

Once the conceptual model is validated, the Logical Data Model is the next step. A logical data model is an explosion of the conceptual data model into all of the entities (many, many more than in the conceptual model), attributes, and relationships that completely define a business. This really gets into the detail of the business where the "Sales Order" entity may be attributed with "Sales Order Identifier," "Customer Identifier," "Total Sales Amount," (and many more). This model is the base or "master" model from which all other models are derived. It is important to use this model in a continuous mode–as the business changes, the logical model should be made to reflect those changes. These changes are then reflected, as needed, into the physical data model. Key point: This model shows all the detailed business rules via relationships and is one reason that this model is so valuable to maintain.

Physical Data Model has been made ready for use on a relational database management system (RDBMS) by making alterations that may improve access speed, data update times, user access, data storage efficiency, and many other possibilities. This stage requires a very good knowledge of the intended use by the business users across the entire enterprise and also requires an extensive knowledge of the RDBMS and its capabilities. The physical data model should also change when the logical model has to be changed (when the business terminology changes, or when other important changes need to be made in the business). Key point: This model often loses many business rules that must be then transferred to external processes surrounding the maintenance of the RDBMS. As the maintenance of the RDBMS changes, the Logical Data Model must be consulted to review and validate that the business rules are being properly applied to the maintenance and utilization of the RDBMS.

One of the advantages to using these models is that, because they are validated in sequence, changes are fairly rare and specific to the business needs. This means the models are relatively stable over time.

Stable models are a good thing.

The really good ones could be called supermodels. The most attractive models solve the most complex problems with stability and elegance. I won't digress on the differences between a supermodel on a runway at Sak's versus a supermodel in a Sak's Teradata Data Warehouse.

The good ones begin, however, with this three-tiered approach, which provides the “best practice” pathway for creating and managing data models.

00-mark crosby blog bio march 2015Mark Crosby is a Senior Product Manager with Architecture and Modeling Solutions and is responsible for development of the Teradata Transportation and Logistics Data Model. Mark has been with Teradata since 1989 and has over 45 years’ experience in software development, data center management, database administration, and data modeling, having also worked for the U.S. Department of Defense, The American Banker’s Association, and consulting firms.


PART FIVE: This is the last blog in my series about Near Real Time data acquisition from SAP. This final blog is co-written with Arno Luijten, who is one of Teradata’s lead engineers. He is instrumental in demystifying the secrets of the elusive SAP clustered and pooled tables.

There is a common misconception that the Pool and Cluster tables in SAP R/3 can only be deciphered by the SAP R/3 application server, giving them an almost mythical status. The phrase that is used all over the ‘help sites’ and forums is “A clustered and a pooled table cannot be read from outside SAP because certain data are clustered and pooled in one field”… which makes replicating these tables pretty pointless – right?

But what exactly are Pooled and Cluster tables in SAP R/3 anyway? We thought we would let SAP give us the answer and searched their public help pages (SAP Help Portal). But that yielded limited results, so we looked further -- (Googled cluster tables) and found the following explanation (Technopedia-link):

“Cluster tables are special types of tables present in the SAP data dictionary. They are logical tables maintained as records of the normal SAP tables, which are commonly known as transparent tables. A key advantage of using cluster tables is that data is stored in a compressed format, reducing memory space and the landscape network load for retrieving information from these tables.”

Reading further on the same page, there are six major bullet points describing the features - of which fiveof them basically tell you that what we did cannot be done. Luckily, we didn’t let this phase us!

We agree: the purpose of SAP cluster tables is to save space because of the huge volume of the data contained in these tables and the potential negative impact that this may have on the SAP R/3 application. We know this because the two most (in-) famously large Cluster tables are RFBLG and KOCLU which contain the financial transactions and price conditions. SAP’s ABAP programmers often refer to BSEG (for financial transactions) and KONV (for the price-conditions).

From the database point of view, these tables do not exist but are contained in the physical tables named RFBLG and KOCLU. Typically these (ABAP) tables contain a lot of data. There are more tables set up in this way, but from a data warehousing point of view these two tables are probably the most relevant. Simply skipping these tables would not be a viable option for most clients.

Knowing the importance of the Pool and Cluster table, the value of data replication, and the value of operational analytics, we forged ahead with a solution. The encoded data from the ABAP table is stored as a BLOB “Binary Large Object” in the actual cluster table. To decode the data in the BLOB we wrote a C++ program as a Teradata User Defined Function (UDF) which we call the “Decoder” and it is installed directly within the Teradata database.

There can be a huge volume of data present in the cluster tables (hence the usage of cluster logic) and as a result decoding can be a lot of work and can have an impact on the performance of the SAP system. Here we have an extra advantage over SAP R/3 because the Decoder effectively allows us to bypass the ABAP layer and use the power of the Teradata Database. Our MPP capabilities allow decoding to be done massively faster than the SAP application, so decoding the RFBLG/KOCLU tables in Teradata can save a lot of time.

Over the last few months I have written about data replication starting with a brief SAP history, I questioned real-time systems, and I have written about the benefits of data replication and how it is disruptive to analytics for SAP R/3.

In my last blog I looked at the technical complexities we have had to overcome to build a complete data replication solution into Teradata Analytics for SAP® Solutions. It has not been a trivial exercise - but the benefits are huge!

Our Data Replication capability enables operational reporting and managerial analytics from the same source; it increases flexibility, significantly reduces the burden on the SAP R/3 system(s), and of course, delivers SAP data in near-real time for analytics.