data science

The Joy of Data Viz: The Data You Weren’t Looking For

Posted on: December 8th, 2015 by Guest Blogger 1 Comment

 

“A good sketch is better than a long speech” - Napoléon Bonaparte

I recently came across this quote on the opening page of Phil Simons’ book,  “The Visual Organization, Data Visualization, Big Data, and the Quest for Better Decisions.” It is available online.

Data visualization, or Data Viz as it is often referred to, is expanding -- especially in the area of analytics. Numbers are numbers -- but if you can represent it graphically, it enters the realm of meaning. If we represent the data visually to a detailed level of grain, it can bring about surprising discoveries.

Data visualization can bring what was once hidden into plain site. In Zen terms, it can help one achieve what is called “beginner's eyes.”

I talked to Yasmeen Ahmad, a talented and creative data scientist at Teradata, about her visualization project titled “Trapping Anomalies.” In this visualization, the two unexpected blue clusters helped to identify fraudulent behavior.

anomalies - data viz blogTo see the full visualization and details about the project, visit the Teradata Art Gallery.

For this visualization, Yasmeen used an open source tool called Gephi. The tool called Gephi bills itself as “The Open Graph Viz Platform.” This is a great interactive tool for data analysis. It can use just about any data source you provide. To download the tool, click here.

There is so much to explore -- and so I thought I'd help you find even more to check out for yourself .Here are some other open source data visualization tools that are widely used.

D3 – a library
D3 stands for data-driven documents. It is a JavaScript library for manipulating documents based on data. You can go to download the latest version by visiting this online page. And here is an example of a tree structure that is created with D3.

R – a language
R is a language and environment for statistical computing and graphics. It provides a variety of statistical and graphical techniques, and is highly extensible. For an example of an R visualization, click here.

Processing
Processing is a programming language used within the context of the visual arts. Even with some very basic code and your data source you can create stunning visuals. Here is an example of a visualization I did a few years ago using just the color data from an image of the Golden Gate Bridge.

Golden Gate blog LorieThe image on the left is the source for color data and the image on the right is the visualization at a point in time -- as the code continuously scans the colors.

To visit the site and download the processing code, click here.

Also, here is a link to an interesting video by Jer Thorp, my former instructor. The guy is brilliant. His work brings to light some interesting “data made human” ideas - which I find very intriguing - and inspiring.

There are also several “data viz" products on the market today. While Tableau is one widely known tool, the Teradata Aster Discovery Platform provides a suite of ready to use SQL for data discovery and visualization. Here is a link to the Aster visualization presentation - Teradata Aster VIZ.

These are exciting times for data visualization techniques and related tools. I hope this quick overview gives Teradata readers some ideas for further exploration into the world of Data Visualization. There are so many positive surprises to be discovered in the data you weren't looking for!

61856_153945041296607_1815422_nLorie Nelson is the Senior Product Manager for Teradata’s Travel and Hospitality Data Model. She describes herself as an artist-in-residence with a passion for data. 

 

 

Data Modeling Requires Detailed Mapping — Learn Why

Posted on: November 30th, 2015 by Guest Blogger No Comments

 

Mapping is an important step to understanding your data and where the data resides in your ecosystem. Mapping takes us from the known to the unknown and is effectively accomplished by using mapping tools, adopting best practices, and having a common understanding of how the mappings will be used. But mapping does take a considerable amount of time and requires a person with extensive knowledge in the source or target system or both.

Our team maps from an industry logical data model (core model) to access path building blocks -- then to semantic structures such as dimensions, or lower to higher level facts.

Access path building blocks (APBBs) are designed to help the semantic modeler develop dimensions and facts (which are specific and denormalized) for the semantic data model from the core data model (which is normalized and generalized). APBBs bridge the gap between the dimensional and normalized logical data models. To help the modeler in using APPBs, construction maps are included in the SMBBs to illustrate how a dimension can be built from the core data model (the appropriate Teradata industry data model) using the access path building blocks. The following picture is an example of an APBB construction map. The green boxes represent tables in the core data model (the Teradata Financial Services Data Model, in this example), the orange boxes are the APBBs, and the white boxes are the resulting dimensions.

In this example the construction map visually layouts the data needed to identify a person who is insured by a policy (join path) and the diagram can also be understood by business analysts that may not model or write SQL.
Performing mappings helps our team identify gaps in both models. In one case, core data may need to be added to the semantic data model in the other, the core data model may need to be expanded with data required by BI reports. The gap analysis identifies the missing pieces of data need to address the business requirements.

Papierniak picture for blog Nov 30Initially, our group performed detailed attribute-to-attribute mapping, but then switched to higher-level, entity-to-entity mappings -- to save time. The time saved on detailed attribute-to-attribute mappings, many of which are obvious, was instead focused on describing the purpose of the mappings thru filter and join notes. In the picture above Gender Type is an entity while Gender Type Description (in Gender Type) is an attribute.

We found that we could save a considerable amount of time by reusing the mappings from areas such address or product in the same industry or across industries. And using a tool to perform the mappings makes them easily reusable, adaptable, and assists in standardizing around best practices.
Mappings help with many things, including:
• Focusing on specific areas in an ecosystem
• Finding and resolving information gaps as well as design gaps
• Creating the core layer views for the semantic layer
• Establishing a reusable base for common content such as location or product
• Supporting a more precise way of communicating and refining details during design and implementation

Our team finds mapping to be useful, reusable, and educational -- and is a worthwhile investment of our time.
For mapping we use Teradata Mapping Manager (TMM) found on Teradata Developer Exchange. Information on our Industry Data Models (iDMs) and Solution Modeling Building Blocks is on www.teradata.com.

Karen Papierniak cropKaren Papierniak is a Product Manager responsible for development of Teradata’s Industry Solution Modeling Building Blocks and Data Integration Roadmap portfolio -- that spans eight major industries and used by customers worldwide. Karen’s roles at Teradata have been in software development, systems architecture, and visual modeling while working in a variety of industries including retail and communications.

Pluralism and Secularity In a Big Data Ecosystem

Posted on: August 25th, 2015 by Guest Blogger No Comments

 

Solutions around today's analytic ecosystem are too technically driven without focusing on business values. The buzzwords seem to over-compensate the reality of implementation and cost of ownership. I challenge you to view your analytic architecture using pluralism and secularity. Without such a view of this world your resume will fill out nicely but your business values will suffer.

In my previous role, prior to joining Teradata, I was given the task of trying to move "all" of our organization’s BI data to Hadoop. I will share my approach - how best-in-class solutions come naturally when pluralism and secularity are used to support a business-first environment.

Big data has exposed some great insights into what we can, should, and need to do with our data. However, this space is filled with radical opinions and the pressure to "draw a line in the sand" between time-proven methodologies and what we know as "big data." Some may view these spaces moving in opposite directions; however, these spaces will collide. The question is not "if" but "when." What are we doing now to prepare for this inevitability? Hadapt seems to be moving in the right direction in terms of leadership between the two spaces.

Relational Databases
I found many of the data sets in relational databases to be lacking in structure, highly transient, and loosely coupled. Data scientists needed to have quick access to data sets to perform their hypothesis testing.

Continuously requesting IT to rerun their ETL processes was highly inefficient. A data scientist once asked me "Why can't we just dump the data in a Linux mount for exploration?" Schema-on-write was too restrictive as the data scientists could not predefine the attributes for the data set for ingestion. As the data sets became more complex and unstructured, the ETL processes became exponentially more complicated and performance was hindered.

I also found during this exercise that my traditional BI analysts were perplexed with formulating questions about the data. One of the reasons was that businesses did not know what questions to ask. This is a common challenge in the big data ecosystem. We are used to knowing our data and being able to come up with incredible questions about it. The BI analyst's world has been disrupted as they now need to ask "What insights/answers do I have about my data?" – (according to IIya Katsov in one of his blogs).

Hadoop/NoSQL
The product owner of Hadoop was convinced that the entire dataset should be hosted on Amazon Web Services (S3) which would allow our analytics (via Elastiv Map Reduce) to perform at incredible speeds. However, due to various ISO guidelines, the data sets had to be encrypted at rest and in transit which degraded performance by approximately 30 percent.

Without an access path model, logical model, or unified model, business users and data scientists were left with little appetite for unified analytics. Data scientists were on their own guidelines for integrated/ federated/governed/liberated post-discovery analytical sets.

Communication with the rest of the organization became an unattainable goal. The models which came out of discovery were not federated across the organization as there was a disconnect between the data scientists, data architects, Hadoop engineers, and data stewards -- who spoke different languages. Data scientists were creating amazing predictive models and at the same time data stewards were looking for tools to help them provide insight in prediction for the SAME DATA.

Using NoSQL for a specific question on a dataset required a new collection set. To maintain and govern the numerous collections became a burden. There had to be a better way to answer many questions without having a linear relationship to the number of collections instantiated. The answer may be within access path modeling.

Another challenge I faced was when users wanted a graphical representation of the data and the embedded relationships or lack thereof. Are they asking for a data model? The users would immediately say no, since they read in a blog somewhere that data modeling is not required using NoSQL technology.

At the end of this entire implementation I found myself needing to integrate these various platforms for the sake of providing a business-first solution. Maybe the line in the sand isn't a business-first approach? Those that drive Pluralism (a condition or system in which two or more states, groups, principles, sources of authority, etc., coexist) and Secularity (not being devoted to a specific technology or data 'religion') within their analytic ecosystem -- can truly deliver a business-first solution approach while avoiding the proverbial "silver bullet" architecture solutions.

In my coming post, I will share some of the practices for access path modeling within Big Data and how it supports pluralism and secularity within a business-first analytic ecosystem.

Sunile Manjee

Sunile Manjee is a Product Manager in Teradata’s Architecture and Modeling Solutions team. Big Data solutions are his specialty, along with the architecture to support a unified data vision. He has over 12 years of IT experience as a Big Data architect, DW architect, application architect, IT team lead, and 3gl/4gl programmer.

Optimization in Data Modeling 1 – Primary Index Selection

Posted on: July 14th, 2015 by Guest Blogger No Comments

 

In my last blog I spoke about the decisions that must be made when transforming an Industry Data Model (iDM) from Logical Data Model (LDM) to an implementable Physical Data Model (PDM). However, being able to generate DDL (Data Definition Language) that will run on a Teradata platform is not enough – you also want it to perform well. While it is possible to generate DDL almost immediately from a Teradata iDM, each customer’s needs mandate that existing structures be reviewed against data and access demographics, so that optimal performance can be achieved.

Having detailed data and access path demographics during PDM design is critical to achieving great performance immediately, otherwise it’s simply guesswork. Alas, these are almost never available at the beginning of an installation, but that doesn’t mean you can’t make “excellent guesses.”

The single most influential factor in achieving PDM performance is proper Primary Index (PI) selection for warehouse tables. Data modelers are focused on entity/table Primary Keys (PK) since it is what defines uniqueness at the row level. Because of this, a lot of physical modelers tend to implement the PK as a Unique Primary Index (UPI) on each table as a default. But one of the keys to Teradata’s great performance is that it utilizes the PI to physical distribute data within a table across the entire platform to optimize parallelism. Each processor gets a piece of the table based on the PI, so rows from different tables with the same PI value are co-resident and do not need to be moved when two tables are joined.

In a Third Normal Form (3NF) model no two entities (outside of super/subtypes and rare exceptions) will have the same PK, so if chosen as a PI, it stands to reason that no two tables share a PI and every table join will require data from at least one table to be moved before a join can be completed – not a solid performance decision to say the least.

The iDM’s have preselected PI’s largely based on Identifiers common across subject areas (i.e. Party Id) so that all information regarding that ID will be co-resident and joins will be AMP-local. These non-unique PI’s (NUPI’s) are a great starting point for your PDM, but again need to be evaluated against customer data and access plans to insure that both performance and reasonably even data distribution is achieved.

Even data distribution across the Teradata platform is important since skewed data can contribute both to poor performance and to space allocation (run out of space on one AMP, run out of space on all). However, it can be overemphasized to the detriment of performance.

Say, for example, a table has a PI of PRODUCT_ID, and there are a disproportionate number of rows for several Products causing skewed distribution Altering the PI to the table PK instead will provide perfectly even distribution, but remember, when joining to that table, if all elements of the PK are not available then the rows of the table will need to be redistributed, most likely by PRODUCT_ID.

This puts them back under the AMP where they were in the skewed scenario. This time instead of a “rest state” skew the rows will skew during redistribution, and this will happen every time the table is joined to – not a solid performance decision. Optimum performance can therefore be achieved with sub-optimum distribution.

iDM tables relating two common identifiers will usually have one of the ID’s pre-selected as a NUPI. In some installations the access demographics will show that other ID may be the better choice. If so, change it! Or it may give leave you with no clear choice, in which case picking one is almost assuredly better than
changing the PI to a composite index consisting of both ID’s as this will only result in a table no longer co-resident with any table indexed by either of the ID’s alone.

There are many other factors that contribute to achieving optimal performance of your physical model, but they all pale in comparison to a well-chosen PI. In my next blog we’ll look at some more of these and discuss when and how best to implement them.

Jake Kurdsjuk Biopic-resize July 15

Jake Kurdsjuk is Product Manager for the Teradata Communications Industry Data Model, purchased by more than one hundred Communications Service Providers worldwide. Jake has been with Teradata since 2001 and has 25 years of experience working with Teradata within the Communications Industry, as a programmer, DBA, Data Architect and Modeler.

Why We Love Presto

Posted on: June 24th, 2015 by Daniel Abadi No Comments

 

Concurrent with acquiring Hadoop companies Hadapt and Revelytix last year, Teradata opened the Teradata Center for Hadoop in Boston. Teradata recently announced that a major new initiative of this Hadoop development center will include open-source contributions to a distributed SQL query engine called Presto. Presto was originally developed at Facebook, and is designed to run high performance, interactive queries against Big Data wherever it may live --- Hadoop, Cassandra, or traditional relational database systems.

Among those people who will be part of this initiative and contributing code to Presto include a subset of the Hadapt team that joined Teradata last year. In the following, we will dive deeper into the thinking behind this new initiative from the perspective of the Hadapt team. It is important to note upfront that Teradata’s interest in Presto, and the people contributing to the Presto codebase, extends beyond the Hadapt team that joined Teradata last year. Nonetheless, it is worthwhile to understand the technical reasoning behind the embrace of Presto from Teradata, even if it presents a localized view of the overall initiative.

Around seven years ago, Ashish Thusoo and his team at Facebook built the first SQL layer over Hadoop as part of a project called Hive. At its essence, Hive was a query translation layer over Hadoop: it received queries in a SQL-like language called Hive-QL, and transformed them into a set of MapReduce jobs over data stored in HDFS on a Hadoop cluster. Hive was truly the first project of its kind. However, since its focus was on query translation into the existing MapReduce query execution engine of Hadoop, it achieved tremendous scalability, but poor efficiency and performance, and ultimately led to a series of subsequent SQL-on-Hadoop solutions that claimed 100X speed-ups over Hive.

Hadapt was the first such SQL-on-Hadoop solution that claimed a 100X speed-up over Hive on certain types of queries. Hadapt was spun out of the HadoopDB research project from my team at Yale and was founded by a group of Yale graduates. The basic idea was to develop a hybrid system that is able to achieve the fault-tolerant scalability of the Hive MapReduce query execution engine while leveraging techniques from the parallel database system community to achieve high performance query processing.

The intention of HadoopDB/Hadapt was never to build its own query execution layer. The first version of Hadapt used a combination of PostgreSQL and MapReduce for distributed query execution. In particular, the query operators that could be run locally, without reliance on data located on other nodes in the cluster, were run using PostgreSQL’s query operator set (although Hadapt was written such that PostgreSQL could be replaced by any performant single-node database system). Meanwhile, query operators that required data exchange between multiple nodes in the cluster were run using Hadoop’s MapReduce engine.

Although Hadapt was 100X faster than Hive for long, complicated queries that involved hundreds of nodes, its reliance on Hadoop MapReduce for parts of query execution precluded sub-second response time for small, simple queries. Therefore, in 2012, Hadapt started to build a secondary query execution engine called “IQ” which was intended to be used for smaller queries. The idea was that all queries would be fed through a query-analyzer layer before execution. If the query was predicted to be long and complex, it would be fed through Hadapt’s original fault-tolerant MapReduce-based engine. However, if the query would complete in a few seconds or less, it would be fed to the IQ execution engine.

presto graphic blogIn 2013 Hadapt integrated IQ with Apache Tez in order avoid redundant programming efforts, since the primary goals of IQ and Tez were aligned. In particular, Tez was designed as an alternative to MapReduce that can achieve interactive performance for general data processing applications. Indeed, Hadapt was able to achieve interactive performance on a much wider-range of queries when leveraging Tez, than what it was able to achieve previously.

Figure 1: Intertwined Histories of SQL-on-Hadoop Technology

Unfortunately Tez was not quite a perfect fit as a query execution engine for Hadapt’s needs. The largest issue was that before shipping data over the network during distributed operators, Tez first writes this data to local disk. The overhead of writing this data to disk (especially when the size of the intermediate result set was large) precluded interactivity for a non-trivial subset of Hadapt’s query workload. A second problem is that the Hive query operators that are implemented over Tez use (by default) traditional Volcano-style row-by-row iteration. In other words, a single function-invocation for a query operator would process just a single database record. This resulted in a larger number of function calls required to process a large dataset, and poor instruction cache locality as the instructions associated with a particular operator were repeatedly reloaded into the instruction cache for each function invocation. Although Hive and Tez have started to alleviate this issue with the recent introduction of vectorized operators, Hadapt still found that query plans involving joins or SQL functions would fall back to row-by-row iteration.

The Hadapt team therefore decided to refocus its query execution strategy (for the interactive query part of Hadapt’s engine) to Presto, which presented several advantages over Tez. First, Presto pipelines data between distributed query operators directly, without writing to local disk, significantly improving performance for network-intensive queries. Second, Presto query operators are vectorized by default, thereby improving CPU efficiency and instruction cache locality. Third, Presto dynamically compiles selective query operators to byte code, which lets the JVM optimize and generate native machine code. Fourth, it uses direct memory management, thereby avoiding Java object allocations, its heap memory overhead and garbage collection pauses. Overall, Presto is a very advanced piece of software, and very much in line with Hadapt’s goal of leveraging as many techniques from modern parallel database system architecture as possible.

The Teradata Center for Hadoop has thus fully embraced Presto as the core part of its technology strategy for the execution of interactive queries over Hadoop. Consequently, it made logical sense for Teradata to take its involvement in the Presto to the next level. Furthermore, Hadoop is fundamentally an open source project, and in order to become a significant player in the Hadoop ecosystem, Teradata needs to contribute meaningful and important code to the open source community. Teradata’s recent acquisition of Think Big serves as further motivation for such contributions.

Therefore Teradata has announced that it is committed to making open source contributions to Presto, and has allocated substantial resources to doing so. Presto is already used by Silicon Valley stalwarts Facebook, AirBnB, NetFlix, DropBox, and Groupon. However, Presto’s enterprise adoption outside of silicon valley remains small. Part of the reason for this is that ease-of-use and enterprise features that are typically associated with modern commercial database systems are not fully available with Presto. Missing are an out-of the-box simple-to-use installer, database monitoring and administration tools, and third-party integrations. Therefore, Teradata’s initial contributions will focus in these areas, with the goal of bridging the gap to getting Presto widely deployed in traditional enterprise applications. This will hopefully lead to more contributors and momentum for Presto.

For now, Teradata’s new commitments to open source contributions in the Hadoop ecosystem are focused on Presto. Teradata’s commitment to Presto and its commitment to making meaningful contributions to an open source project is an exciting development. It will likely have a significant impact on enterprise-adoption of Presto. Hopefully, Presto will become a widely used open source parallel query execution engine --- not just within the Hadoop community, but due to the generality of its design and its storage layer agnosticism, for relational data stored anywhere.

 

Learn more or download Presto now.

==================================================================================

daniel abadi crop BLOG bio mgmtDaniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). Follow Daniel on Twitter @Daniel_Abadi

 

I recently participated in a business analytics project for non-profits that, as the planning progressed, seemed like a perfect opportunity to implement an agile approach, except that the work was to be completed in two days! But all the developers would be co-located. We had three objectives that fit the profile of user stories. We would cleanse, analyze, and report on the data and, hopefully, discover some insights. We would have the business stakeholders in the room with us the whole time. But doing all this in two days seemed like agile on steroids to me. And it reminded me of an old Stephen Wright joke, “I put instant coffee in the microwave and almost went back in time!”

So, if you put agile on steroids, can you go back in time? Well, maybe not, but we did accomplish a lot in those two days! The project was a DataDive, a collaboration between the non-profit, DataKind, and Teradata, that was held the two days before the Teradata Partners 2014 conference.

Blog data dive teamsI was a Data Ambassador paired with another Data Ambassador to work with a non-governmental organization (NGO) to prepare for the DataDive and make sure we reached our goals. The NGO that DataKind assigned us to was iCouldBe, an organization that provides on-line mentoring to at-risk kids at over 250 schools in the U.S. Since I am not a data scientist or analyst, I found my role as gathering requirements from the business stakeholders at iCouldBe. I worked with them to prioritize the requirements and identify the expected business value. Sounds like the product owner role in “Scrum” -- right? My partner Data Ambassador worked with the head of IT at iCouldBe to identify the data we needed and worked to prepare it for the data dive. This is similar to a Scrum project, where preparatory work must be completed to be ready for the first sprint.

DataKind wanted us to identify the tasks to accomplish each user story, so I immediately thought about using a task board for the actual DataDive. I created one ahead of time in Excel that identified the tasks for each user story as well as the development and handoff phases for each story. I didn’t realize it at the time, but I was creating a Kanban board (a portion of the board is shown in the picture) that allowed us to track workflow.

Blog - Data dive KanbanOnce I got to the DataDive, I recreated the Kanban board using flip chart paper and used sticky notes for the tasks, much the way it might be done for a real project. The user stories were listed in priority order from top to bottom. The tasks represented the metrics, dimensions, text and other analysis required to address the user stories. Some tasks supported multiple user stories, so we noted those and used that “re-use” to help prioritize. We placed these reusable tasks at the top of the board in the swimlane with the highest priority user story. (Click on the figure at left to enlarge - DataDive Kanban Board - Partial Workflow)

 

For example, the number of posts and words per post that mentors and mentees made in the online mentoring program was an important metric that iCouldBe wanted to calculate to help identify successful mentee completion of the program. Are mentees that write more posts and words per post more likely to complete the program? This question addresses the first user story. But number of posts and words per post can also be used to analyze the amount of engagement between mentors and their mentees and what areas of the curriculum need to be improved.

As the volunteers arrived, they chose tasks, focusing on the high priority tasks first, wrote their name on the sticky notes, and moved the note to the first development column, which was to review the available data.

blog data dive - whiteboardAt different times during the day, DataKind asked each team to review what they had done so far, and what they planned on doing next, similar to the daily standup in Scrum (and we actually did stand).

As the DataDive progressed to day two, only tasks for user stories 1 and 2 progressed across the board, but I reminded the team that some of the tasks we completed for the first two user stories also helped address the third user story. At the end of the DataDive, to better visually show this, I moved some of the sticky notes from user story 1 into the user story 3 swimlane. This way, we could show the business stakeholders from iCouldBe that, although we focused on the higher priority user stories 1 and 2, we had also partially addressed user story 3.

Although this project did not check all the boxes in being a standard agile implementation, it served as a great opportunity for me to put some agile practices in motion in a real project and learn from it. One of the most important aspects was the close collaboration between the developers and stakeholders. It was great to see how thrilled the stakeholders were with the work we had accomplished in just two days!

While I wish I could go back in time and do the DataDive all over again, as it was a great personal experience for me, instead I’ll look to the future and apply what I’ve learned from this project to my next agile project.

Blog ElissaElisia Getts is a Sr. Product Manager, Certified Scrum Master (CSM), and member of the Teradata Agile COE. She has been with Teradata for 15 years and has over 25 years of experience in IT as a product manager, business/IT consultant, programmer/analyst, and technical writer supporting industries such as travel and hospitality, transportation and logistics, and defense. She is the team’s expert on Scrum.

Making SAP data relevant in the world of big data

Posted on: May 4th, 2015 by Patrick Teunissen No Comments

 

Part one of series about an old “SAP”dog who learns a new trick

Reflecting back on the key messages from Teradata Universe 2015 in April it was impossible to escape the theme of deriving differentiated business value leveraging the latest data sources and analytic techniques. I heard from several customers how they improved their business by combining their traditional warehouse data (or ‘SAP data’ for us old dogs) with other data from across the enterprise and applying advanced analytic techniques. A special interest group dedicated a whole morning exploring the value of integrating ‘SAP data’ with ‘other data’. As I sat thru these sessions, I found it funny that companies that run SAP ERP always speak about their data in terms of SAP data and other data. It made me wonder what is ‘other data’ and what makes it so special?

In most cases, ‘other data’ is referred to as ‘Big Data’. The term is quite ubiquitous and was used to describe just about every data source. But it’s important to note, that throughout the sessions I attended, none of the companies referred to their SAP data as Big Data. Big Data was a term reserved for the (relatively) new sources of data like machine generated data from the Internet of Things, call center details, POS related data, and social media/web logs.

Although not “big”, customers told me they considered their SAP ERP applications to be complex fortresses of data. In comparison to traditional data warehouses or big data stores, SAP is very difficult to extract and integrate with their ‘other data’. Even SAP acknowledges this shortcoming as evidenced by their recent programs to ‘Simplify’ their complex applications. But I’d argue that while SAP ERPs maybe complex to run, the data that is processed in these applications is quite simple. SAP experts would agree that if you know where to look, the data is both simple and reliable.

Unfortunately these experts live in a world of their own which is focused entirely on data that flows thru SAP. But as evidenced by the customers at Teradata Universe the lion’s share of new IT projects/ business initiatives are focused on leveraging ‘big data’. Which means the folks who know SAP are rarely involved in the IT projects involving ‘big data’, and vice versa, which explains the chasm between SAP and ‘other data’. The ‘Big Data’ folks don’t understand the valuable context that SAP brings. And the ‘SAP data’ folks don’t understand the new insights that analytics on the ‘other data’ can deliver.

However, the tides are turning and the general consensus has evolved to agree that there is value in brining SAP data together with big data. SAP ERP is used primarily for managing the transactional processes in the financial, logistics, manufacturing, and administration functions. This means it houses high quality master data, attribute data, and detailed facts about the business. Combining this structured and reliable data up to multi-structured big data can add valuable confidence and context to the analytics that matter most to businesses today!

Here’s a recent example where a customer integrated the results of advanced text analytics with their SAP ERP data within their Teradata warehouse. The data science team was experimenting with a number of Aster machine learning and natural language processing techniques to find meaning and insight in field technician reports. Using one of Aster’s text analytic methods, Latent Dirichlet Allocation, they were able to identify common related word groups within their reports to identify quality events such as “broken again” or “running as expected”. However they discovered unexpected insight regarding equipment suppliers and 3rd party service providers also hidden in the field reports, such as “Supplier XYZ is causing problems” or “ABC is easy to work with”. They were then able to integrate all of these relatable word groups with context from the SAP ERP purchasing history data stored in the warehouse to provide additional insight and enrichment to their supplier scores.

 

 

High Level Data Analytics Graph
(Healthcare Example)

 <---- Click on image to view GRAPH ANIMATION

Michael Porter, in an excellent article in the November 2014 issue of the Harvard Business Review[1], points out that smart connected products are broadening competitive boundaries to encompass related products that meet a broader underlying need. Porter elaborates that the boundary shift is not only from the functionality of discrete products to cross-functionality of product systems, but in many cases expanding to a system of systems such as a smart home or smart city.

So what does all this mean from a data perspective? In that same article, Porter mentions that companies seeking leadership need to invest in capturing, coordinating, and analyzing more extensive data across multiple products and systems (including external information). The key take-away is that the movement of gaining competitive advantage by searching for cross-functional or cross-system insights from data is only going to accelerate and not slow down. Exploiting cross-functional or cross-system centrality of data better than anyone else will continue to remain critical to achieving a sustainable competitive advantage.

Understandably, as technology changes, the mechanisms and architecture used to exploit this cross-system centrality of data will evolve. Current technology trends point to a need for a data & analytic-centric approach that leverages the right tool for the right job and orchestrates these technologies to mask complexity for the end users; while also managing complexity for IT in a hybrid environment. (See this article published in Teradata Magazine.)

As businesses embrace the data & analytic-centric approach, the following types of questions will need to be addressed: How can business and IT decide on when to combine which data and to what degree? What should be the degree of data integration (tight, loose, non-coupled)? Where should the data reside and what is the best data modeling approach (full, partial, need based)? What type of analytics should be applied on what data?

Of course, to properly address these questions, an architecture assessment is called for. But for the sake of going beyond the obvious, one exploratory data point in addressing such questions could be to measure and analyze the cross-functional/cross-system centrality of data.

By treating data and analytics as a network of interconnected nodes in Gephi[2], the connectedness between data and analytics can be measured and visualized for such exploration. We can examine a statistical metric called Degree Centrality[3] which is calculated based on how well an analytic node is connected.

The high level sample data analytics graph demonstrates the cross-functional Degree Centrality of analytics from an Industry specific perspective (Healthcare). It also amplifies, from an industry perspective, the need for organizations to build an analytical ecosystem that can easily harness this cross-functional Degree Centrality of data analytics. (Learn more about Teradata’s Unified Data Architecture.)

In the second part of this blog post series we will walk through a zoomed-in view of the graph, analyze the Degree Centrality measurements for sample analytics, and draw some high-level data architecture implications.

[1] https://hbr.org/2014/11/how-smart-connected-products-are-transforming-competition

[2] Gephi is a tool to explore and understand graphs. It is a complementary tool to traditional statistics.

[3] Degree centrality is defined as the number of links incident upon a node (i.e., the number of ties that a node has).

Ojustwin blog bio

Ojustwin Naik (MBA, JD) is a Director with 15 years of experience in planning, development, and delivery of Analytics. He has experience across multiple industries and is passionate at nurturing a culture of innovation based on clarity, context, and collaboration.

Hybrid Row-Column Stores: A General and Flexible Approach

Posted on: March 10th, 2015 by Daniel Abadi No Comments

 

During a recent meeting with a post-doc in my lab at Yale, he reminded me that this summer will mark the 10-year anniversary of the publication of C-Store in VLDB 2005. C-Store was by no means the first ever column-store database system (the column-store idea has been around since the 70s --- nearly as long as relational database systems), but it was quite possibly the first proposed architecture of a column-store designed for petabyte-scale data analysis. The C-Store paper has been extremely influential, with close to every major database vendor developing column-oriented extensions to their core database product in the past 10 years, with most of them citing C-Store (along with other influential systems) in their corresponding research white-papers about their column-oriented features.

Given my history with the C-Store project, I surprised a lot of people when some of my subsequent projects such as HadoopDB/Hadapt did not start with a column-oriented storage system from the beginning. For example, industry analyst Curt Monash repeatedly made fun of me on this topic (see, e.g. http://www.dbms2.com/2012/10/16/hadapt-version-2/).

In truth, my love and passion for column-stores has not diminished since 2005. I still believe that every analytical database system should have a column-oriented storage option. However, it is naïve to think that column-oriented storage is always the right solution. For some workloads --- especially those that scan most rows of a table but only a small subset of the columns, column-stores are clearly preferable. On the other hand, there any many workloads that contain very selective predicates and require access to the entire tuple for those rows which pass the predicate. For such workloads, row-stores are clearly preferable.

abadi new March 10 first graphicThere is thus general consensus in the database industry that a hybrid approach is needed. A database system should have both column-oriented and row-oriented storage options, and the optimal storage can be utilized depending on the expected workload.

Despite this consensus around the general idea of the need for a hybrid approach, there is a glaring lack of consensus about how to implement the hybrid approach. There have been many different proposals for how to build hybrid row/column-oriented database systems in the research and commercial literature. A sample of such proposals include:

(1) A fractured mirrors approach where the same data is replicated twice --- once in a column-oriented storage layer and once in a row-oriented storage layer. For any particular query, data is extracted from the optimal storage layer for that query, and processed by the execution engine.
(2) A column-oriented simulation within a row-store. Let’s say table X contains n columns. X gets replaced by n new tables, where each new table contains two columns --- (1) a row-identifier column and (2) the column values for one of the n columns in the original table. Queries are processed by joining together on the fly the particular set of these two-column tables that correspond to the columns accessed by that query.
(3) A “PAX” approach where each page/block of data contains data for all columns of a table, but data is stored column-by-column within the page/block.
(4) A column-oriented index approach where the base data is stored in a row-store, but column-oriented storage and execution can be achieved through the use of indexes.
(5) A table-oriented hybrid approach where a database administrator (DBA) is given a choice to store each table row-by-row or column-by-column, and the DBA makes a decision based on how they expect the tables to be used.
In the rest of this post, I will overview Teradata’s elegant hybrid row/column-store design and attempt to explain why I believe it is more flexible than the above-mentioned approaches.

The flexibility of Teradata’s approach is characterized by three main contributions.

1: Teradata views the row-store vs. column-store debate as two extremes in a more general storage option space.

The row-store extreme stores each row continuously on storage and the column-store extreme stores each column continuously on storage. In other words, row-stores maintain locality of horizontal access of a table, and column-stores maintain locality of vertical access of table. In general however, the optimal access-locality could be on a rectangular region of a table.

adadi second graphic March 10

Figure 1: Row and Column Stores (uncompressed)

To understand this idea, take the following example. Many workloads have frequent predicates on date attributes. By partitioning the rows of a table according to date (e.g. one partition per day, week, month, quarter, or year), those queries that contain predicates on date can be accelerated by eliminating all partitions corresponding to dates outside the range of the query, thereby efficiently utilizing I/O to read in data from the table from only those partitions that have data matching the requested data range.

However, different queries may analyze different table attributes for a given date range. For example, one query may examine the total revenue brought in per store in the last quarter, while another query may examine the most popular pairs of widgets bought together in each product category in the last quarter. The optimal storage layout for such queries would be to have store and revenue columns stored together in the same partition, and to have product and product category columns stored together in the same partition. Therefore we want both column-partitions (store and revenue in one partition and product and product category in a different partition) and row-partitions (by date).

This arbitrary partitioning of a table by both rows and columns results in a set of rectangular partitions, each partition containing a subset of rows and columns from the original table. This is far more flexible than a “pure” column-store that enforces that each column be stored in a different physical or virtual partition.

Note that allowing arbitrary rectangular partitioning of a table is a more general approach than a pure column-store or a pure row-store. A column-store is simply a special type of rectangular partitioning where each partition is a long, narrow rectangle around a single column of data. Row-oriented storage can also be simulated with special types of rectangles. Therefore, by supporting arbitrary rectangular partitioning, Teradata is able to support “pure” column-oriented storage, “pure” row-oriented storage, and many other types of storage between these two extremes.

2: Teradata can physically store each rectangular partition in “row-format” or “column-format.”

One oft-cited advantage of column-stores is that for columns containing fixed-width values, the entire column can be represented as a single array of values. The row id for any particular element in the array can be determined directly by the index of the element within the array. Accessing a column in an array-format can lead to significant performance benefits, including reducing I/O and leveraging the SIMD instruction set on modern CPUs, since expression or predicate evaluation can occur in parallel on multiple array elements at once.

Another oft-cited advantage of column-stores (especially within my own research --- see e.g. http://db.csail.mit.edu/projects/cstore/abadisigmod06.pdf ) is that column-stores compress data much better than row-stores because there is more self-similarity (lower entropy) of data within a column than across columns, since each value within a column is drawn from the same attribute domain. Furthermore, it is not uncommon to see the same value repeat multiple times consecutively within a column, in which case the column can be compressed using run-length encoding --- a particularly useful type of compression since it can both result in high compression ratios and also is trivial to operate on directly, without requiring decompression of the data.

Both of these advantages of column-stores are supported in Teradata when the column-format is used for storage within a partition. In particular, multiple values of a column (or a small group of columns) are stored continuously in an array within a Teradata data structure called a “container”. Each container comes with a header indicating the row identifier of the first value within the container, and the row identifiers of every other value in the container can be deduced by adding their relative position within the container to the row identifier of the first value. Each container is automatically compressed using the optimal column-oriented compression format for that data, including run-length encoding the data when possible.

abadi third graphic March 10

Figure 2: Column-format storage using containers.

However, one disadvantage of not physically storing the row identifier next to each value is that extraction of a value given a row identifier requires more work, since additional calculations must be performed to extract the correct value from the container. In some cases, these additional calculations involve just positional offsetting; however, in some cases, the compressed bits of the container have to be scanned in order to extract the correct value. Therefore Teradata also supports traditional row-format storage within each partition, where the row identifier is explicitly stored alongside any column values associated with that row. When partitions are stored using this “row format”, Teradata’s traditional mechanisms for quickly finding a row given a row identifier can be leveraged.

In general, when the rectangular partitioning scheme results in wide rectangles, row format storage is recommended, since the overhead of storing the row id with each row is amortized across the breadth of the row, and the benefits of array-oriented iteration through the data are minimal. But when the partitioning scheme results in narrow rectangles, column-format storage is recommended, in order to get the most out of column-oriented array iteration and compression. Either way --- having a choice between row format and column format for each partition further improves the flexibility of Teradata’s row/columnar hybrid scheme.

3: Teradata enables traditional primary indexing for quick row-access even when column-oriented partitioning is used.

Many column-stores do not support primary indexes due to the complexity involved in moving around records as a result of new inserts into the index. In contrast, Teradata Database 15.10 supports two types of primary indexing when a table has been partitioned to AMPs (logical servers) by the hash value of the primary index attribute. The first, called CPPI, maintains all row and column partitions on an AMP sorted by the hash value of the primary index attribute. These hash values are stored within the row identifier for the record, which enables each column partition to independently maintain the same sort order without explicitly communicating with each other. Since the data is sorted by the hash of the primary index attribute, finding particular records for a given value of the primary index attribute is extremely fast. The second, called CPPA, does not sort by the hash of the primary index attribute --- therefore the AMP that contains a particular record can be quickly identified given a value of the primary index attribute. However, further searching is necessary within the AMP to find the particular record. This searching is limited to the non-eliminated, nonempty column and row partitions. Finding a particular record given a row id for both CPPI and CPPA is extremely fast since, in either case, the records are in row id order.

Combined, these three features make Teradata’s hybrid solution to the row-store vs. column-store tradeoff extremely general and flexible. In fact, it’s possible to argue that there does not exist a more flexible hybrid solution from a major vendor on the market. Teradata has also developed significant flexibility inside its execution engine --- adapting to column-format vs. row-format input automatically, and using optimal query execution methods depending on the format-type that a particular query reads from.
====================================================================================

daniel abadi crop BLOG bio mgmt

Daniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). http://twitter.com/#!/daniel_abadi.

Data-Driven Design: Smart Modeling in the Fast Lane

Posted on: February 24th, 2015 by Guest Blogger 2 Comments

 

In this blog, I would like to discuss a different way of modeling data regardless of the method such as Third Normal Form or Dimensional or Analytical datasets. This new way of data modeling will cut down the development cycles by avoiding rework, be agile, and produce higher quality solutions. It’s a discipline that looks at requirements and data as input into the design.

A lot of organizations have struggled getting the data model correct, especially for application, which has a big impact on different phases of the system development lifecycle. Generally, we elicit requirements first where the IT team and business users together create a business requirements document (BRD).

Business users explain business rules and how source data should be transformed into something they can use and understand. We then create a data model using the BRD and produce a technical requirements documentation which is then used to develop the code. Sometimes it takes us over 9 months before we start looking at the source data. This delay in engaging data almost every time causes rework since the design was based only on requirements. The other extreme end of this is when a design is based only on data.

We have always either based the design solely on requirements or data but hardly ever using both methods. We should give the business users what they want and yet be mindful of the realities of data.

It has been almost impossible to employ both methods for different reasons such as traditional waterfall method where BDUF (Big Design Up Front) is introduced without ever looking at the data. Other reasons are we work with data but the data is either created for proof of concept or testing which is farther from the realities of production data. To do this correctly, we need JIT (Just in Time) or good enough requirements and then get into the data quickly and mold our design based on both the requirements and data.

The idea is to get into the data quickly and validate the business rules and assumptions made by business users. Data-driven design is about engaging the data early. It is more than data profiling, as data-driven design inspects and adapts in context of the target design. As we model our design, we immediately begin loading data into it, often by day one or two of the sprint. That is the key.

Early in the sprint, data-driven design marries the perspective of the source data to the perspective of the business requirements to identify gaps, transformation needs, quality issues, and opportunities to expand our design. End users generally know about the day to day business but are not aware of the data.

The data-driven design concept can be used whether an organization is practicing waterfall or agile methodology. It obviously fits very nicely with the agile methodologies and Scrum principles such as inspect and adapt. We inspect the data and adapt the design accordingly. Using DDD we can test the coverage and fit of the target schema, from the analytical user perspective. By encouraging the design and testing of target schema using real data in quick, iterative cycles, the development team can ensure that target schema designed for implementation have been thoroughly reviewed, tested and approved by end-users before project build begins.

Case Study: While working with a mega-retailer, in one of the projects I was decomposing business questions. We were working with promotions and discounts subject area and we had two metrics: Promotion Sales Amount and Commercial Sales Amount. Any item that was sold as part of a promotion is counted towards Promotion Sales and any item that is sold as regular is counted towards Commercial Sales. Please note that Discount Amount and Promotion Sales Amount are two very different metrics. While decomposing, the business user described that each line item within a transaction (header) would have the discount amount evenly proportioned.

Data driven design graphicFor example – Let’s say there is a promotion where if you buy 3 bottles of wine then you get 2 bottles free. In this case, according to the business user, there would be discount amount evenly proportioned across the 5 line items - thus indicating that these 5 line items are on promotion and we can count the sales of these 5 line items toward Promotion Sales Amount.

This wasn’t the case when the team validated this scenario against the data. We discovered that the discount amount was only present for the “get” items and not for the “buy” items. Using our example, discount amount was provided for the 2 free bottles (get) and not for 3 bottles (buy). This makes it hard to calculate Promotion Sales Amount for the 3 “buy” items since it wasn’t known if the customer just bought 3 items or 5 items unless we looked at all the records, which was in millions every day.

What if the customer bought 6 bottles of wine so ideally 5 lines are on promotion and the 6th line (diagram above) is commercial sales or regular sales? Looking at the source data there was no way of knowing which transaction lines are part of promotion and which aren’t.

After this discovery, we had to let the business users know about the inaccuracy for calculating Promotion Sales Amount. Proactively, we designed a new fact to accommodate for the reality of data. There were more complicated scenarios that the team discovered that the business user hadn’t thought of.

In the example above, we had the same item for “buy” and “get” which was wine. We found a scenario, where a customer bought a 6 pack of beer then got a glass free. This further adds to the complexity. After validating the business rules against source data, we had to request additional data for “buy” and “get” list to properly calculate Promotion Sales Amount.

Imagine finding out that you need additional source data to satisfy business requirements nine months into the project. Think about change request for data model, development, testing etc. With DDD, we found this out within days and adapted to the “data realities” within the same week. The team also discovered that the person at the POS system could either pick up a wine bottle and times it by 7 or he could “beep” each bottle one by one. This inconsistency makes a lot of difference such as one record versus 7 records in the source feed.

There were other discoveries we made along the way as we got into the data and designed the target schema while keeping the reality of the data in mind. We were also able to ensure that the source system has the right available grain that the business users required.

Grover Sachin bio pic blog small

Sachin Grover leads the Teradata Agile group within Teradata. He has been with Teradata for 5 years and has worked on development of Solution Modeling Building Blocks and helped define best practices for semantic data models on Teradata. He has over 10 years of experience in IT industry as a BI / DW architect, modeler, designer, analyst, developer and tester.