Analytics

Making SAP data relevant in the world of big data

Posted on: May 4th, 2015 by Patrick Teunissen No Comments

 

Part one of series about an old “SAP”dog who learns a new trick

Reflecting back on the key messages from Teradata Universe 2015 in April it was impossible to escape the theme of deriving differentiated business value leveraging the latest data sources and analytic techniques. I heard from several customers how they improved their business by combining their traditional warehouse data (or ‘SAP data’ for us old dogs) with other data from across the enterprise and applying advanced analytic techniques. A special interest group dedicated a whole morning exploring the value of integrating ‘SAP data’ with ‘other data’. As I sat thru these sessions, I found it funny that companies that run SAP ERP always speak about their data in terms of SAP data and other data. It made me wonder what is ‘other data’ and what makes it so special?

In most cases, ‘other data’ is referred to as ‘Big Data’. The term is quite ubiquitous and was used to describe just about every data source. But it’s important to note, that throughout the sessions I attended, none of the companies referred to their SAP data as Big Data. Big Data was a term reserved for the (relatively) new sources of data like machine generated data from the Internet of Things, call center details, POS related data, and social media/web logs.

Although not “big”, customers told me they considered their SAP ERP applications to be complex fortresses of data. In comparison to traditional data warehouses or big data stores, SAP is very difficult to extract and integrate with their ‘other data’. Even SAP acknowledges this shortcoming as evidenced by their recent programs to ‘Simplify’ their complex applications. But I’d argue that while SAP ERPs maybe complex to run, the data that is processed in these applications is quite simple. SAP experts would agree that if you know where to look, the data is both simple and reliable.

Unfortunately these experts live in a world of their own which is focused entirely on data that flows thru SAP. But as evidenced by the customers at Teradata Universe the lion’s share of new IT projects/ business initiatives are focused on leveraging ‘big data’. Which means the folks who know SAP are rarely involved in the IT projects involving ‘big data’, and vice versa, which explains the chasm between SAP and ‘other data’. The ‘Big Data’ folks don’t understand the valuable context that SAP brings. And the ‘SAP data’ folks don’t understand the new insights that analytics on the ‘other data’ can deliver.

However, the tides are turning and the general consensus has evolved to agree that there is value in brining SAP data together with big data. SAP ERP is used primarily for managing the transactional processes in the financial, logistics, manufacturing, and administration functions. This means it houses high quality master data, attribute data, and detailed facts about the business. Combining this structured and reliable data up to multi-structured big data can add valuable confidence and context to the analytics that matter most to businesses today!

Here’s a recent example where a customer integrated the results of advanced text analytics with their SAP ERP data within their Teradata warehouse. The data science team was experimenting with a number of Aster machine learning and natural language processing techniques to find meaning and insight in field technician reports. Using one of Aster’s text analytic methods, Latent Dirichlet Allocation, they were able to identify common related word groups within their reports to identify quality events such as “broken again” or “running as expected”. However they discovered unexpected insight regarding equipment suppliers and 3rd party service providers also hidden in the field reports, such as “Supplier XYZ is causing problems” or “ABC is easy to work with”. They were then able to integrate all of these relatable word groups with context from the SAP ERP purchasing history data stored in the warehouse to provide additional insight and enrichment to their supplier scores.

 

 

00-11-HC-QOC-BQ-DataLabZoomed in view of Data Analytics Graph
(Healthcare Example)

<---- Click on image to view GRAPH ANIMATION

In the first part of this two part blog series, I discussed the competitive importance of cross-functional analytics [1]. I also proposed that by treating Data and Analytics as a network of interconnected nodes in Gephi [2], we can examine a statistical metric for analytics called Degree Centrality [3]. In this second part of the series I will now examine parts of the sample Healthcare industry graph animation in detail and draw some high level conclusions from the Degree Centrality measurement for analytics.

In this sample graph [4], link analysis was performed on a network of 3428 nodes and 8313 directed edges. Majority of the nodes represent either Analytics or Source Data Elements. Many Analytics in this graph tend to require data from multiple source systems resulting in cross functional Degree Centrality (connectedness). Some of the Analytics in this study display more Degree Centrality than others.

The zoomed-in visualization starts with a single source system (green) with its data elements (cyan). Basic function specific analytics (red) can be performed with this single Clinical source system data. Even advanced analytics (Text Analysis) can be applied to this single source of data to yield function specific insights.

But data and business never exist separately in isolation. Usually cross-functional analytics emerge with users looking to gain additional value from combining data from various source systems. Notice how these new analytics are using data from source systems in multiple functional areas such as Claims and Membership. Such cross functional data combination or data coupling can now be supported at various levels of sophistication. For instance, data can be loosely coupled for analysis with data virtualization, or if requirements dictate, it can be tightly coupled within a relational Integrated Data Warehouse.

As shown in the graph, even advanced analytics such as Time Series and Naïve Bayes can utilize data from multiple source systems. A data platform that can loosely couple or combine data for such cross-functional advanced analytics can be critical for efficient discovering insights from new sources of data (see discovery platform). More importantly as specific advanced analytics are eventually selected for operationalization, a data platform needs to easily integrate results and support easy access regardless of where the advanced analytics are being performed.

Degree Ranking for sample Analytics from the Healthcare Industry Graph

Degree Analytic Label
3 How can we reduce manual effort required to evaluate physician notes and medical records in conjunction with billing procedure codes?
10 How can number of complaints to Medicare be reduced in an effort to improve the overall STAR rating?
22 What is the ratio of surgical errors to hospital patients? And total medical or surgical errors? (Provider, Payer)
47 What providers are active in what networks and products? What is the utilization? In total, by network, by product
83 What are the trends over time for utilization for patients who use certain channels?
104 What is the cost of care PMPM?   For medical, For Pharmacy, Combined.   How have clinical interventions impacted this cost over time?

The sample analytics listed above demonstrate varying degree of cross-functional Degree Centrality and should be supported with varying level of data coupling. This can range from non-coupled data to loosely coupled data to tightly coupled data. As the number of Analytics with cross-functional Degree Centrality cluster together it may indicate a need to employ tighter data coupling or data integration to drive consistency in the results being obtained. The clustering of Analytics may also be an indication of an emerging need for a data mart or extension of Integrated Data Warehouse that can be utilized by a broader audience.

In-Degree Ranking for sample Data Elements from the Healthcare Industry Graph

In-Degree Source Element
46 Accounts Receivable*PROVIDER BILL-Bill Payer Party Id
31 Clinical*APPLICATION PRODUCT-Product Id
25 Medical Claims*CLAIM-Claim Num
25 Membership*MEMBER-Agreement Id

Similarly if Data start to show high Degree Centrality it may be an indication for re-assessing whether there is a need for tighter coupling to drive consistency and enable broader data reuse. When the In-Degree metric is applied, Data being used by more Analytics appears larger on the graph and is a likely candidate for tighter coupling. To support data design for tighter coupling from a cross functional and even a cross industry perspective Teradata offers reference data model blueprints by industry. (See Teradata Data Models)

This calls for a data management ecosystem with data analytics platforms that can easily harvest this cross-functional Degree Centrality of Analytics and Data. Such a data management ecosystem would support varying degrees of data coupling, varying types of analytics, and varying types of data access based on data users. (Learn more about Teradata’s Unified Data Architecture.)

The analysis described above is exploratory and by no means a replacement for a thorough architectural assessment. Eventually the decision to employ the right degree of data coupling should rest on the full architecture requirements including but not limited to data integrity, security, or business value.

In conclusion, what our experiences have taught us in the past will still hold true for the future:
• Data sources are exponentially more valuable when combined or integrated with other data sets
• To maintain sustained competitive advantage business has to continue to search for insights building on the cross-functional centrality of data
• Unified data management ecosystems can now harvest this cross-functional centrality of data at a lower cost with efficient support for varying levels of data integration, analytic types, and users

Contact Teradata to learn more about how Teradata technology, architecture, and industry expertise can efficiently and effectively harvest this centrality of Data and Analytics.

[1] https://hbr.org/2014/11/how-smart-connected-products-are-transforming-competition

[2] Gephi is a tool to explore and understand graphs. It is a complementary tool to traditional statistics.

[3] Degree centrality is defined as the number of links incident upon a node (i.e., the number of ties that a node has).

[4] This specific industry example is illustrative and subject to the limitations of assumptions and quality of the sample data mappings used for this study.

Ojustwin blog bio

 

 

Ojustwin Naik (MBA, JD) is a Director with 15 years of experience in planning, development, and delivery of Analytics. He has experience across multiple industries and is passionate at nurturing a culture of innovation based on clarity, context, and collaboration.

--

 

High Level Data Analytics Graph
(Healthcare Example)

 <---- Click on image to view GRAPH ANIMATION

Michael Porter, in an excellent article in the November 2014 issue of the Harvard Business Review[1], points out that smart connected products are broadening competitive boundaries to encompass related products that meet a broader underlying need. Porter elaborates that the boundary shift is not only from the functionality of discrete products to cross-functionality of product systems, but in many cases expanding to a system of systems such as a smart home or smart city.

So what does all this mean from a data perspective? In that same article, Porter mentions that companies seeking leadership need to invest in capturing, coordinating, and analyzing more extensive data across multiple products and systems (including external information). The key take-away is that the movement of gaining competitive advantage by searching for cross-functional or cross-system insights from data is only going to accelerate and not slow down. Exploiting cross-functional or cross-system centrality of data better than anyone else will continue to remain critical to achieving a sustainable competitive advantage.

Understandably, as technology changes, the mechanisms and architecture used to exploit this cross-system centrality of data will evolve. Current technology trends point to a need for a data & analytic-centric approach that leverages the right tool for the right job and orchestrates these technologies to mask complexity for the end users; while also managing complexity for IT in a hybrid environment. (See this article published in Teradata Magazine.)

As businesses embrace the data & analytic-centric approach, the following types of questions will need to be addressed: How can business and IT decide on when to combine which data and to what degree? What should be the degree of data integration (tight, loose, non-coupled)? Where should the data reside and what is the best data modeling approach (full, partial, need based)? What type of analytics should be applied on what data?

Of course, to properly address these questions, an architecture assessment is called for. But for the sake of going beyond the obvious, one exploratory data point in addressing such questions could be to measure and analyze the cross-functional/cross-system centrality of data.

By treating data and analytics as a network of interconnected nodes in Gephi[2], the connectedness between data and analytics can be measured and visualized for such exploration. We can examine a statistical metric called Degree Centrality[3] which is calculated based on how well an analytic node is connected.

The high level sample data analytics graph demonstrates the cross-functional Degree Centrality of analytics from an Industry specific perspective (Healthcare). It also amplifies, from an industry perspective, the need for organizations to build an analytical ecosystem that can easily harness this cross-functional Degree Centrality of data analytics. (Learn more about Teradata’s Unified Data Architecture.)

In the second part of this blog post series we will walk through a zoomed-in view of the graph, analyze the Degree Centrality measurements for sample analytics, and draw some high-level data architecture implications.

[1] https://hbr.org/2014/11/how-smart-connected-products-are-transforming-competition

[2] Gephi is a tool to explore and understand graphs. It is a complementary tool to traditional statistics.

[3] Degree centrality is defined as the number of links incident upon a node (i.e., the number of ties that a node has).

Ojustwin blog bio

Ojustwin Naik (MBA, JD) is a Director with 15 years of experience in planning, development, and delivery of Analytics. He has experience across multiple industries and is passionate at nurturing a culture of innovation based on clarity, context, and collaboration.

Funnel Analysis: an Approach from the Power Marketer Playbook

Posted on: March 31st, 2015 by Guest Blogger No Comments

 

funnel imagePower marketers are always interested in the most effective ways to track, measure, and analyze customer experiences for more relevant engagement. I’d like to share an approach that is less known yet potentially quite powerful.

Businesses across global markets are re-thinking data, analytics, platforms, and research methods to better understand their customers. Event analytics offers a new view of the customer, leveraging best technologies and diverse data sources, to obtain actionable insights in real time. Traditional methods help us understand consumers in terms of the following aspects: who, what, when, and where. Yet two of the most important questions for understanding consumers (“why” and “how”) are un-answered. The answers are key to obtaining business value because they can help us understand the why and how of consumers’ interactions with a company.

Traditional approaches focus on how the customer looks to the business. For example, what do you buy? What segments are you in? When was your last visit? However, the more important question should be “how does the business look to the customer?” How do our customers experience our products and brands? How do customers feel at each touch point?

One major advantage of event analytics over traditional methods is that it can improve our understanding of the customer’s view of the business. Traditional systems are not designed to solicit, extract and stitch together customer experience data well. Event analytics obtains information about the entire customer experience in detail, threading together many sources of information from different applications that combine to deliver the full view of customer experience.

To conduct event analytics, businesses need to create a “customer experience universe” that stitches customers’ experiences together, allows for easy behavior pattern recognition and facilitates visualizations of customer behaviors. This universe includes social media, customer experience, marketing channels, mobile apps, and devices. Then, machine learning algorithms are used to run through all the data to identify patterns.

Event analytics is an ecosystem that includes, for example, streaming ingestion of events, event repository, event metadata, guided user interface for business analysts and machine learning algorithms. One category of use cases is called funnel analytics which help us to understand customer behavioral patterns and what triggers their experiences.

Funnel analysis provides visibility across a series of customer experience events that lead towards a defined goal, say, from user engagement in a mobile app to a sale in an eCommerce platform. Funnel analyses are an effective way to calculate conversion rates on specific user behaviors, yet funnel analytics can be complex due to the difficulty in source categorization, visitor identification, pathing, attribution and conversion.

Funnels can be built using a single guided user interface without needing to write code or move data. As a result, event analytics can scale at the speed of business. It is a smart analytic approach because it helps create visibility to the path that users are most likely to follow to achieve their goals.

The value of having this insight is of great significance since it gives marketers a deep, data-driven line of sight into the customer experience universe.

James Semenak

James Semenak

James Semenak is a Principal Consultant with Teradata – known as an evangelist and architect for Event Analytics as well as Big Data Analytics and strategies.  James consults in all things related to data and analytics around the internet, and has worked with Shutterfly, Expedia, eBay Enterprise, Charles Schwab, Nokia, eBay, PayPal, Real Networks, Overstock.com, Electronic Arts, and Meredith Corp.

 

 

 

PART FIVE: This is the last blog in my series about Near Real Time data acquisition from SAP. This final blog is co-written with Arno Luijten, who is one of Teradata’s lead engineers. He is instrumental in demystifying the secrets of the elusive SAP clustered and pooled tables.

There is a common misconception that the Pool and Cluster tables in SAP R/3 can only be deciphered by the SAP R/3 application server, giving them an almost mythical status. The phrase that is used all over the ‘help sites’ and forums is “A clustered and a pooled table cannot be read from outside SAP because certain data are clustered and pooled in one field”… which makes replicating these tables pretty pointless – right?

But what exactly are Pooled and Cluster tables in SAP R/3 anyway? We thought we would let SAP give us the answer and searched their public help pages (SAP Help Portal). But that yielded limited results, so we looked further -- (Googled cluster tables) and found the following explanation (Technopedia-link):

“Cluster tables are special types of tables present in the SAP data dictionary. They are logical tables maintained as records of the normal SAP tables, which are commonly known as transparent tables. A key advantage of using cluster tables is that data is stored in a compressed format, reducing memory space and the landscape network load for retrieving information from these tables.”

Reading further on the same page, there are six major bullet points describing the features - of which fiveof them basically tell you that what we did cannot be done. Luckily, we didn’t let this phase us!

We agree: the purpose of SAP cluster tables is to save space because of the huge volume of the data contained in these tables and the potential negative impact that this may have on the SAP R/3 application. We know this because the two most (in-) famously large Cluster tables are RFBLG and KOCLU which contain the financial transactions and price conditions. SAP’s ABAP programmers often refer to BSEG (for financial transactions) and KONV (for the price-conditions).

From the database point of view, these tables do not exist but are contained in the physical tables named RFBLG and KOCLU. Typically these (ABAP) tables contain a lot of data. There are more tables set up in this way, but from a data warehousing point of view these two tables are probably the most relevant. Simply skipping these tables would not be a viable option for most clients.

Knowing the importance of the Pool and Cluster table, the value of data replication, and the value of operational analytics, we forged ahead with a solution. The encoded data from the ABAP table is stored as a BLOB “Binary Large Object” in the actual cluster table. To decode the data in the BLOB we wrote a C++ program as a Teradata User Defined Function (UDF) which we call the “Decoder” and it is installed directly within the Teradata database.

There can be a huge volume of data present in the cluster tables (hence the usage of cluster logic) and as a result decoding can be a lot of work and can have an impact on the performance of the SAP system. Here we have an extra advantage over SAP R/3 because the Decoder effectively allows us to bypass the ABAP layer and use the power of the Teradata Database. Our MPP capabilities allow decoding to be done massively faster than the SAP application, so decoding the RFBLG/KOCLU tables in Teradata can save a lot of time.

Over the last few months I have written about data replication starting with a brief SAP history, I questioned real-time systems, and I have written about the benefits of data replication and how it is disruptive to analytics for SAP R/3.

In my last blog I looked at the technical complexities we have had to overcome to build a complete data replication solution into Teradata Analytics for SAP® Solutions. It has not been a trivial exercise - but the benefits are huge!

Our Data Replication capability enables operational reporting and managerial analytics from the same source; it increases flexibility, significantly reduces the burden on the SAP R/3 system(s), and of course, delivers SAP data in near-real time for analytics.

Hybrid Row-Column Stores: A General and Flexible Approach

Posted on: March 10th, 2015 by Daniel Abadi No Comments

 

During a recent meeting with a post-doc in my lab at Yale, he reminded me that this summer will mark the 10-year anniversary of the publication of C-Store in VLDB 2005. C-Store was by no means the first ever column-store database system (the column-store idea has been around since the 70s --- nearly as long as relational database systems), but it was quite possibly the first proposed architecture of a column-store designed for petabyte-scale data analysis. The C-Store paper has been extremely influential, with close to every major database vendor developing column-oriented extensions to their core database product in the past 10 years, with most of them citing C-Store (along with other influential systems) in their corresponding research white-papers about their column-oriented features.

Given my history with the C-Store project, I surprised a lot of people when some of my subsequent projects such as HadoopDB/Hadapt did not start with a column-oriented storage system from the beginning. For example, industry analyst Curt Monash repeatedly made fun of me on this topic (see, e.g. http://www.dbms2.com/2012/10/16/hadapt-version-2/).

In truth, my love and passion for column-stores has not diminished since 2005. I still believe that every analytical database system should have a column-oriented storage option. However, it is naïve to think that column-oriented storage is always the right solution. For some workloads --- especially those that scan most rows of a table but only a small subset of the columns, column-stores are clearly preferable. On the other hand, there any many workloads that contain very selective predicates and require access to the entire tuple for those rows which pass the predicate. For such workloads, row-stores are clearly preferable.

abadi new March 10 first graphicThere is thus general consensus in the database industry that a hybrid approach is needed. A database system should have both column-oriented and row-oriented storage options, and the optimal storage can be utilized depending on the expected workload.

Despite this consensus around the general idea of the need for a hybrid approach, there is a glaring lack of consensus about how to implement the hybrid approach. There have been many different proposals for how to build hybrid row/column-oriented database systems in the research and commercial literature. A sample of such proposals include:

(1) A fractured mirrors approach where the same data is replicated twice --- once in a column-oriented storage layer and once in a row-oriented storage layer. For any particular query, data is extracted from the optimal storage layer for that query, and processed by the execution engine.
(2) A column-oriented simulation within a row-store. Let’s say table X contains n columns. X gets replaced by n new tables, where each new table contains two columns --- (1) a row-identifier column and (2) the column values for one of the n columns in the original table. Queries are processed by joining together on the fly the particular set of these two-column tables that correspond to the columns accessed by that query.
(3) A “PAX” approach where each page/block of data contains data for all columns of a table, but data is stored column-by-column within the page/block.
(4) A column-oriented index approach where the base data is stored in a row-store, but column-oriented storage and execution can be achieved through the use of indexes.
(5) A table-oriented hybrid approach where a database administrator (DBA) is given a choice to store each table row-by-row or column-by-column, and the DBA makes a decision based on how they expect the tables to be used.
In the rest of this post, I will overview Teradata’s elegant hybrid row/column-store design and attempt to explain why I believe it is more flexible than the above-mentioned approaches.

The flexibility of Teradata’s approach is characterized by three main contributions.

1: Teradata views the row-store vs. column-store debate as two extremes in a more general storage option space.

The row-store extreme stores each row continuously on storage and the column-store extreme stores each column continuously on storage. In other words, row-stores maintain locality of horizontal access of a table, and column-stores maintain locality of vertical access of table. In general however, the optimal access-locality could be on a rectangular region of a table.

adadi second graphic March 10

Figure 1: Row and Column Stores (uncompressed)

To understand this idea, take the following example. Many workloads have frequent predicates on date attributes. By partitioning the rows of a table according to date (e.g. one partition per day, week, month, quarter, or year), those queries that contain predicates on date can be accelerated by eliminating all partitions corresponding to dates outside the range of the query, thereby efficiently utilizing I/O to read in data from the table from only those partitions that have data matching the requested data range.

However, different queries may analyze different table attributes for a given date range. For example, one query may examine the total revenue brought in per store in the last quarter, while another query may examine the most popular pairs of widgets bought together in each product category in the last quarter. The optimal storage layout for such queries would be to have store and revenue columns stored together in the same partition, and to have product and product category columns stored together in the same partition. Therefore we want both column-partitions (store and revenue in one partition and product and product category in a different partition) and row-partitions (by date).

This arbitrary partitioning of a table by both rows and columns results in a set of rectangular partitions, each partition containing a subset of rows and columns from the original table. This is far more flexible than a “pure” column-store that enforces that each column be stored in a different physical or virtual partition.

Note that allowing arbitrary rectangular partitioning of a table is a more general approach than a pure column-store or a pure row-store. A column-store is simply a special type of rectangular partitioning where each partition is a long, narrow rectangle around a single column of data. Row-oriented storage can also be simulated with special types of rectangles. Therefore, by supporting arbitrary rectangular partitioning, Teradata is able to support “pure” column-oriented storage, “pure” row-oriented storage, and many other types of storage between these two extremes.

2: Teradata can physically store each rectangular partition in “row-format” or “column-format.”

One oft-cited advantage of column-stores is that for columns containing fixed-width values, the entire column can be represented as a single array of values. The row id for any particular element in the array can be determined directly by the index of the element within the array. Accessing a column in an array-format can lead to significant performance benefits, including reducing I/O and leveraging the SIMD instruction set on modern CPUs, since expression or predicate evaluation can occur in parallel on multiple array elements at once.

Another oft-cited advantage of column-stores (especially within my own research --- see e.g. http://db.csail.mit.edu/projects/cstore/abadisigmod06.pdf ) is that column-stores compress data much better than row-stores because there is more self-similarity (lower entropy) of data within a column than across columns, since each value within a column is drawn from the same attribute domain. Furthermore, it is not uncommon to see the same value repeat multiple times consecutively within a column, in which case the column can be compressed using run-length encoding --- a particularly useful type of compression since it can both result in high compression ratios and also is trivial to operate on directly, without requiring decompression of the data.

Both of these advantages of column-stores are supported in Teradata when the column-format is used for storage within a partition. In particular, multiple values of a column (or a small group of columns) are stored continuously in an array within a Teradata data structure called a “container”. Each container comes with a header indicating the row identifier of the first value within the container, and the row identifiers of every other value in the container can be deduced by adding their relative position within the container to the row identifier of the first value. Each container is automatically compressed using the optimal column-oriented compression format for that data, including run-length encoding the data when possible.

abadi third graphic March 10

Figure 2: Column-format storage using containers.

However, one disadvantage of not physically storing the row identifier next to each value is that extraction of a value given a row identifier requires more work, since additional calculations must be performed to extract the correct value from the container. In some cases, these additional calculations involve just positional offsetting; however, in some cases, the compressed bits of the container have to be scanned in order to extract the correct value. Therefore Teradata also supports traditional row-format storage within each partition, where the row identifier is explicitly stored alongside any column values associated with that row. When partitions are stored using this “row format”, Teradata’s traditional mechanisms for quickly finding a row given a row identifier can be leveraged.

In general, when the rectangular partitioning scheme results in wide rectangles, row format storage is recommended, since the overhead of storing the row id with each row is amortized across the breadth of the row, and the benefits of array-oriented iteration through the data are minimal. But when the partitioning scheme results in narrow rectangles, column-format storage is recommended, in order to get the most out of column-oriented array iteration and compression. Either way --- having a choice between row format and column format for each partition further improves the flexibility of Teradata’s row/columnar hybrid scheme.

3: Teradata enables traditional primary indexing for quick row-access even when column-oriented partitioning is used.

Many column-stores do not support primary indexes due to the complexity involved in moving around records as a result of new inserts into the index. In contrast, Teradata Database 15.10 supports two types of primary indexing when a table has been partitioned to AMPs (logical servers) by the hash value of the primary index attribute. The first, called CPPI, maintains all row and column partitions on an AMP sorted by the hash value of the primary index attribute. These hash values are stored within the row identifier for the record, which enables each column partition to independently maintain the same sort order without explicitly communicating with each other. Since the data is sorted by the hash of the primary index attribute, finding particular records for a given value of the primary index attribute is extremely fast. The second, called CPPA, does not sort by the hash of the primary index attribute --- therefore the AMP that contains a particular record can be quickly identified given a value of the primary index attribute. However, further searching is necessary within the AMP to find the particular record. This searching is limited to the non-eliminated, nonempty column and row partitions. Finding a particular record given a row id for both CPPI and CPPA is extremely fast since, in either case, the records are in row id order.

Combined, these three features make Teradata’s hybrid solution to the row-store vs. column-store tradeoff extremely general and flexible. In fact, it’s possible to argue that there does not exist a more flexible hybrid solution from a major vendor on the market. Teradata has also developed significant flexibility inside its execution engine --- adapting to column-format vs. row-format input automatically, and using optimal query execution methods depending on the format-type that a particular query reads from.
====================================================================================

daniel abadi crop BLOG bio mgmt

Daniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). http://twitter.com/#!/daniel_abadi.

Foresight Tops Hindsight: Data-Driven Decisions Beat Gut

Posted on: March 9th, 2015 by Tho Nguyen 1 Comment

 

Many decisions we make every day are not data-driven but are made by our gut instinct. That’s fine when those decisions are about what to wear, what to eat, where to go on vacation, and how much to spend on an item. But what about when your choices can impact the company’s bottom line, or even its ability to survive? Companies that report they are data-driven also report they perform better than those that rely on gut instinct.

Today’s fast-paced and competitive business world demands that we be all be data-driven and not rely on gut instinct. Yet, some companies continue to rely on gut instinct instead of data when it comes to making critical decisions.

I recently presented an International Institute of Analytics webinar, “Making the Best Decisions Possible with Enterprise Decision Management,” with independent industry consultant James Taylor and SAS business analytics expert Fiona McNeill.

We asked participants whether their business decisions are driven by data or by gut instinct.

Our findings were:

  • 1 out of 10 make decisions driven by gut instinct
  • 2 out of 10 make decisions driven by facts from data
  • 7 out of 10 make decisions driven by gut and data

Are you the one that still makes decisions with gut instinct?

As businesses become more targeted, personalized and public, it is critical to make precise, data-driven decisions for regulatory compliance and risk management. As data volume, velocity and variety continue to grow, it becomes harder to capture, integrate and analyze the data and to capitalize on the opportunities data uncover. Gut instinct is proving to be insufficient in today’s high-data volume, fast-moving, critical decision-making environment.

Increasingly, companies that are moving to become data-driven are adopting an in-database approach to data analysis. This method allows users to move the decision calculations to inside the database, where the data already resides. Using an in-database approach allows you to optimize the resources and power of the database. It also allows you to minimize data movement. By minimizing data movement, you’re able to dramatically streamline the decision process.

In-database processing of data analysis is not new. Many companies across a broad range of industries have already implemented this technique. For example, one retail bank was able to run a scoring code in 16 seconds, compared to 79 minutes in the traditional method. Other companies and organizations are also using in-database processing and realizing the benefits of strategic discovery. It was not surprising to me that six in 10 of the webinar attendees reported that they consider in-database processing as a way to get answers faster.

Are your gut-driven decisions not delivering the results they used to? Maybe it is time to give your gut instinct a rest and rely on data-driven facts to make sound decisions that affect your business outcomes.

Check out the replay of the webinar I did with James Taylor for the International Institute of Analytics and Fiona McNeill. This webinar will describe how you can use data to drive consistently right decisions and gain competitive advantage. If you want to improve performance, economics and governance within your organization, view the webcast.

- Tho Nguyen, Teradata

Data-Driven Design: Smart Modeling in the Fast Lane

Posted on: February 24th, 2015 by Guest Blogger 2 Comments

 

In this blog, I would like to discuss a different way of modeling data regardless of the method such as Third Normal Form or Dimensional or Analytical datasets. This new way of data modeling will cut down the development cycles by avoiding rework, be agile, and produce higher quality solutions. It’s a discipline that looks at requirements and data as input into the design.

A lot of organizations have struggled getting the data model correct, especially for application, which has a big impact on different phases of the system development lifecycle. Generally, we elicit requirements first where the IT team and business users together create a business requirements document (BRD).

Business users explain business rules and how source data should be transformed into something they can use and understand. We then create a data model using the BRD and produce a technical requirements documentation which is then used to develop the code. Sometimes it takes us over 9 months before we start looking at the source data. This delay in engaging data almost every time causes rework since the design was based only on requirements. The other extreme end of this is when a design is based only on data.

We have always either based the design solely on requirements or data but hardly ever using both methods. We should give the business users what they want and yet be mindful of the realities of data.

It has been almost impossible to employ both methods for different reasons such as traditional waterfall method where BDUF (Big Design Up Front) is introduced without ever looking at the data. Other reasons are we work with data but the data is either created for proof of concept or testing which is farther from the realities of production data. To do this correctly, we need JIT (Just in Time) or good enough requirements and then get into the data quickly and mold our design based on both the requirements and data.

The idea is to get into the data quickly and validate the business rules and assumptions made by business users. Data-driven design is about engaging the data early. It is more than data profiling, as data-driven design inspects and adapts in context of the target design. As we model our design, we immediately begin loading data into it, often by day one or two of the sprint. That is the key.

Early in the sprint, data-driven design marries the perspective of the source data to the perspective of the business requirements to identify gaps, transformation needs, quality issues, and opportunities to expand our design. End users generally know about the day to day business but are not aware of the data.

The data-driven design concept can be used whether an organization is practicing waterfall or agile methodology. It obviously fits very nicely with the agile methodologies and Scrum principles such as inspect and adapt. We inspect the data and adapt the design accordingly. Using DDD we can test the coverage and fit of the target schema, from the analytical user perspective. By encouraging the design and testing of target schema using real data in quick, iterative cycles, the development team can ensure that target schema designed for implementation have been thoroughly reviewed, tested and approved by end-users before project build begins.

Case Study: While working with a mega-retailer, in one of the projects I was decomposing business questions. We were working with promotions and discounts subject area and we had two metrics: Promotion Sales Amount and Commercial Sales Amount. Any item that was sold as part of a promotion is counted towards Promotion Sales and any item that is sold as regular is counted towards Commercial Sales. Please note that Discount Amount and Promotion Sales Amount are two very different metrics. While decomposing, the business user described that each line item within a transaction (header) would have the discount amount evenly proportioned.

Data driven design graphicFor example – Let’s say there is a promotion where if you buy 3 bottles of wine then you get 2 bottles free. In this case, according to the business user, there would be discount amount evenly proportioned across the 5 line items - thus indicating that these 5 line items are on promotion and we can count the sales of these 5 line items toward Promotion Sales Amount.

This wasn’t the case when the team validated this scenario against the data. We discovered that the discount amount was only present for the “get” items and not for the “buy” items. Using our example, discount amount was provided for the 2 free bottles (get) and not for 3 bottles (buy). This makes it hard to calculate Promotion Sales Amount for the 3 “buy” items since it wasn’t known if the customer just bought 3 items or 5 items unless we looked at all the records, which was in millions every day.

What if the customer bought 6 bottles of wine so ideally 5 lines are on promotion and the 6th line (diagram above) is commercial sales or regular sales? Looking at the source data there was no way of knowing which transaction lines are part of promotion and which aren’t.

After this discovery, we had to let the business users know about the inaccuracy for calculating Promotion Sales Amount. Proactively, we designed a new fact to accommodate for the reality of data. There were more complicated scenarios that the team discovered that the business user hadn’t thought of.

In the example above, we had the same item for “buy” and “get” which was wine. We found a scenario, where a customer bought a 6 pack of beer then got a glass free. This further adds to the complexity. After validating the business rules against source data, we had to request additional data for “buy” and “get” list to properly calculate Promotion Sales Amount.

Imagine finding out that you need additional source data to satisfy business requirements nine months into the project. Think about change request for data model, development, testing etc. With DDD, we found this out within days and adapted to the “data realities” within the same week. The team also discovered that the person at the POS system could either pick up a wine bottle and times it by 7 or he could “beep” each bottle one by one. This inconsistency makes a lot of difference such as one record versus 7 records in the source feed.

There were other discoveries we made along the way as we got into the data and designed the target schema while keeping the reality of the data in mind. We were also able to ensure that the source system has the right available grain that the business users required.

Grover Sachin bio pic blog small

Sachin Grover leads the Teradata Agile group within Teradata. He has been with Teradata for 5 years and has worked on development of Solution Modeling Building Blocks and helped define best practices for semantic data models on Teradata. He has over 10 years of experience in IT industry as a BI / DW architect, modeler, designer, analyst, developer and tester.

Lots of Big Data Talk, Little Big Data Action

Posted on: February 11th, 2015 by Manan Goel No Comments

 

 Apps Are One Solution To Big Data Complexity

Offering big data apps is a great way for the analytics industry to put its muscle where its mouth is. Organizations face great hurdles in trying to benefit from the opportunities of big data.  Extracting rapid value from big data remains challenging.

To ease companies into realizing bankable big data benefits, Teradata has developed a collection of big data apps – pre-built templates that act as time-saving short cuts to value. Limited skill sets and complexity make it challenging for analytic professionals to rapidly and consistently derive actionable insights that can be easily operationalized.  Teradata is taking the lead in offering advanced analytic apps powered by Teradata Aster AppCenter to give sophisticated results from big data analytics.

The big data apps from Teradata are industry tailored analytical templates that address business challenges specific to the individual category. Purpose-built apps for retail address path to purchase and shopping cart abandonment.  Apps for healthcare map the paths to surgery and drug prescription affinity. Financial apps tackle omni-channel customer experiences and fraud.  The industries covered include consumer financial, entertainment and gaming, healthcare, manufacturing, retail, communications, travel and hospitality.

Big data apps are pre-built templates that can be further configured with help from Teradata professional services to address specific customer needs or goals.  Organizations have found that specialized big data analytic skills like Python, R, Java and MapReduce take time and require highly specialized manpower. Conversely, apps deliver fast time to value with self-service analytics. The purpose-built apps can be quickly deployed and configured/customized with minimal effort to deliver swift analytic value.

For app distribution, consumption and custom app development, the AppCenter makes big data analytics secure, scalable and repeatable by providing common services to build, deploy and consume apps.

With the apps and related solutions like AppCenter from Teradata, analytic professionals spend less time preparing data and more time doing discovery and iteration to find new insights and value.

Get more big data insights now!

 

 

Teradata Aster AppCenter: Reduce the Chasm of Data Science

Posted on: February 11th, 2015 by John Thuma No Comments

 

Data scientists are doing amazing thing with data and analytics.  The data surface area is exploding with new data sources being invented and exploited almost daily.  The Internet of Things is being realized and is not just theory, it is in practice.   Tools and technology are making it easier for Data Scientists to develop solutions that impact organizations.  Rapid fire methods for predicting churn, providing a personalized next best offer or predicting part failures are just some of the new insights being developed across a variety of industries.

But challenges remain.  Data Science has a language and technique all of its own.  Strange words like: Machine Learning, Naïve Bayes, and Support Vector Machines are creeping into our organizations.   These topics can be very difficult to understand if you are not trained or have not spent time learning to perfect them.

There is a chasm between business and data science.  Reducing this gap and operationalizing big data analytics is paramount to the success of all Data Science efforts.  We must democratize and enable anyone to participate in big data discovery.  The Teradata Aster AppCenter is a big step forward in bridging the gap between data science and the rest of us.  The Teradata Aster AppCenter  makes big data analytics consumable by the masses.

Over the past two years I have personally worked on projects with organizations spanning various vertical industries.  I have engaged with hundreds of people across retail, insurance, government, pharmaceuticals, manufacturing, and others.  The one question that they all ask is: “John, I have people that can develop solutions with Aster; how do I integrate these solutions into my organization?  How can other people use these insights?”  Great questions!

I didn’t have an easy answer, but now I do. The Teradata Aster AppCenter provides a simple to use point and click web interface for consuming big data insights.  It wraps all the complexity and great work that Data Scientists do and gives it a simple interface that anyone can use.  It allows business people to have a conversation with their data like never before.  Data Scientists love it because it gives them a tool to showcase their solutions and their hard work.

Just the other day I deployed my first application in The Teradata Aster AppCenter.  I had never built one before, nor did I have any training or a phone a friend option.  I also didn’t want to have training because I am a technology skeptic.  Technology has to be easy to use.  So I put it to the test and here is what I found.

The interface is intuitive and I had a simple application deployed in 20 minutes.  Another 20 minutes went by and I had three visualization options embedded in my App.   I then constructed a custom user interface that provides drop down menus as options to make the application more flexible and interactive.  In that hour I built an application that anyone can use and they don’t have to know how to write a single line of code or be a technical unicorn.  I was blown away by the simplicity and power.   I am now able to deploy Teradata Aster solutions in minutes and publish them out to the masses.  The Teradata Aster AppCenter reduces the chasm between Data Science and the rest of us.

In conclusion, The Teradata Aster AppCenter passed my tests.  Please, don’t take my word for it, try it out.  Also, we have an abundance of videos, training materials, and templates on the way to guide your experience.  I am really looking forward to seeing new solutions developed and watching the evolution of platform.   The Teradata Aster AppCenter gives Data Science a voice and a platform for Next Generation Analytic consumption.