data management


One way to look at progress in technology is to recognize that each new generation provides a better version of what we’ve always wanted. If you look back at the claims for Hollerith punch card-based computing or the first generation of IBM mainframes, you find that the language is recognizable and can be found in marketing material for modern technology.

This year’s model of technology (and those from 50 or 100 years ago) will provide more efficiency, transparency, automation, and productivity. Yeehaw! I can’t wait. Oh, by the way, the current generation of big data technology will provide the same thing.

And, in fact, every generation of technology has fulfilled these enduring promises, improving on what was achieved in the past. What is important to understand is how. It is often the case that in emphasizing the “new newness” of what is coming down the pike, we forget about essential elements of value in the generation of technology that is being surpassed.

This pattern is alive and well in the current transformation taking place in the world of IT related to the arrival of big data technology, which is changing so many things for the better. The problem is that exaggerations about one aspect of what is new about big data processing, “schema on read” — the ability to add structure at the last minute — is obscuring the need for a process to design and communicate a standard structure for your data, which is called “schema on write.”

Here’s the problem in a nutshell:
• In the past, the entire structure of a database was designed at the beginning of a project. The questions that needed to be answered determined the data that needed to be provided, and well-understood methods were created to model that data, that is, to provide structure so that the questions could be answered. The idea of “schema on write” is that you couldn’t really store the data until you had determined its structure.
• The world of relational database technology and the SQL language was used to answer the questions, which was a huge improvement from having to write a custom program to process each query.
• But as time passed, more data arrived and more questions needed to be answered. It became challenging to manage and change the model in an orderly fashion. People wanted to use new data and answer new questions faster than they could by waiting to get the model changed.

Okay, let’s stop and look at the good and the bad so far. The good is that structure allowed data to be used more efficiently. The more people who used the structure, the more value it created. So, when you have thousands of users asking questions and getting answers from thousands of tables, everything is super great. Taking the time to manage the structure and get it right is worth it. Schema on write is, after all, what drives business fundamentals, such as finance.

But the world is changing fast and new data is arriving all the time, which is not the strength of schema on write. If a department wants to use a new dataset, staff can’t wait for a long process where the central model is changed and the new data arrives. It’s not even clear whether every new source of data should be added to the central model. Unless a large number of people are going to use it, why bother? For discovery, schema on read makes excellent sense.

Self-service technologies like spreadsheets and other great technology for data discovery are used to find answers from this new data. What is lost in this process is the fact that almost all of this data has structure that must be described in some way before the data is used. In a spreadsheet, you need to parse most data into columns. The end-user or analyst does this sort of modeling, not the central keeper of the database, the database administrator, or some other specialist. One thing to note about this sort of modeling is that it is done to support a particular purpose. It is not done to support thousands of users. In fact, adding this sort of structure to data is not generally thought of as modeling, but it is.

Schema on write drives the business forward. So, for big data, for any data, structure must be captured and managed. The most profound evidence of this is the way that all of the “born digital” companies such as Facebook, Netflix, LinkedIn, and Twitter have added large scale SQL databases to their data platforms. These companies were forced to implement schema on write by the needs and scale of their businesses.

Schema on read leads to massive discoveries. Schema on write operationalizes them. They are not at odds; both contribute to the process of understanding data and making it useful. To make the most of all their data, businesses need both schema on read and schema on write.

Dan-Woods Data Points Teradata

Dan Woods is CTO and founder of CITO Research. He has written more than 20 books about the strategic intersection of business and technology. Dan writes about data science, cloud computing, mobility, and IT management in articles, books, and blogs, as well as in his popular column on


Bring your data management program out of the

IT back office and into the enterprise spotlight


By Anita Filippi

The job of a data management professional is a tough one. How do you get ahead of the curve to build out the foundational components of an analytic ecosystem and still serve the immediate, and ever-changing, needs of the organization? The answer lies in the enterprise business initiatives. Align the work of the data management program to the enterprise business initiatives and become a part of what the company already has committed to and cares about.

At this point in the year you’re likely deep into 2016 planning. It’s a safe bet that on one side of your desk is a long list of projects you know you must do to be ready for the future, while on the other side is the list of projects from the enterprise portfolio that need your support. For leaders in data management, the pressure is on. Data is coming in through every channel and in every format. Analysts and data scientists want access to all of that data to help drive insights, competitive advantage and the customer experience to new levels. The tendency is to “go get it all” and get it fast. You know it’s important to get it right. You may be tempted to name an IT project or suite of projects to mature the foundation of your data management program. After all, you’re going to need certain capabilities no matter what the business specifically requests. If you create an IT foundation project, you’ll have to turn away other work, some of which is important. That can create stress in even the most stable IT/business relationships and may cause some of your business partners to go ahead without you. The right approach is to work from a single list, one that is aligned to your organization’s priorities – at the highest level.

Move only if there is a real advantage to be gained

~Sun Tzu – “The Art of War”

But aren’t business initiatives just projects?

Let’s define a business initiative. These are the top funded programs the business plans to do in the near future – like supply chain optimization or implementing digital marketing. They’re important because they are strategic to the business and the business is already committed to them. You usually find these listed on placards throughout the corporate campus. These are not the top 3 projects listed on the corporate portfolio or the IT prioritized project list. These initiatives are stated at the enterprise level and usually have numerous business and IT projects aligned to them.

Once you have a line of sight on those initiatives, you can identify data management (not just the data – also data quality, metadata, master data, governance, etc.) requirements within each and across all initiatives and consolidate those into a comprehensive, data-focused plan. Using the initiatives as guardrails, scope only what is needed, when it is needed. If you have one corporate initiative to implement a digital market capability and another one for supply chain optimization, the data management capabilities required to support those initiatives overlap. Develop a holistic delivery plan and use the initiatives to ensure the scope is contained to what is required to support the immediate corporate focus.

When you approach data management this way, no longer are the analytic ecosystem projects fighting for additional budget and struggling to keep up. They are now part of the strategic funding and execution schedule of the company.

A tale of two parties…

Here’s a compare and contrast.

Scenario 1: Your team has just delivered an integrated data store as a stand-alone, IT foundation project. You’re at the celebration. The CIO is there saying “great work!” and everyone is touting the years of hard work, long hours and technical expertise that your team committed. BUT, there is no message about what the company is doing differently because of all that work – how customers are better served, operating margins are improved, or sales are increased. In fact, business representatives are only there so you can thank them for their UAT participation. You’re all proud, and rightly so, but the key message is: “Glad we got that done. Now everyone go forth and make use of it!”

Scenario 2: You’ve matured the enterprise data management program by directly supporting the company’s business initiatives. You have a roadmap that shows where you’ve focused your team’s efforts (information, applications, systems and enabling processes) and why (from a business perspective). Throughout the year, you delivered incremental value to your data management program by directly supporting the business. As the business horizon shifted, you shifted your capabilities and plan to meet the company’s needs. IT doesn’t get their own party for this one; rather, you’re invited to the big, end of year gala and your team’s contribution to corporate outcomes is evident. The CEO and business representatives are running the show and telling stories of a company today that looks different than the same company last year and you and your team played a key part in that.

Both celebrations are genuine. Both outcomes are good. But the cake tastes so much sweeter at the second party! You are working at the same pace as your company and, once this success is achieved with your business partners, you’re off – together – to achieve the next of the company’s top initiatives.

Back to those lists on your desk…

You have a great opportunity to become an accelerator for your company’s strategy! Get to know the enterprise initiatives and the executives that sponsor them. When you attach your work to the most important efforts in the company, your priority is clear and the value of your team’s work is undeniable.

Teradata Professional Services has a full suite of offerings to assess your Enterprise Data Management program, help you target the most valuable opportunities for maturity and build a roadmap that shows alignment to business initiatives and enabling processes (like data governance). Once we address the high level, we can focus where we’re most needed – analytic roadmap, data quality, data governance, etc. For more information, visit our website.


Anita Filippi Teradata Data PointsAnita Filippi is a Senior Professional Services Consultant, Enterprise Strategy and Governance Center of Excellence. She consults with clients to mature enterprise data management capabilities using practical, step by step approaches that are aligned with business goals, aware of organizational readiness and focused on people, processes, and technology. Anita has spent the last 16 years playing various leadership roles in Enterprise Information Management.


It is well-known that there are two extreme alternatives for storing database tables on any storage media: storing it row-by-row (as done by traditional “row-store” technology) or storing it column-by-column (as done by recently popular “column-store” implementations). Row-stores store the entire first row of the table, followed by the entire second row of the table, etc. Column-stores store the entire first column of the table, followed by the entire second column of the table, etc. There have been huge amounts of research literature and commercial whitepapers that discuss the various advantages of these alternative approaches, along with various proposals for hybrid solutions (which I discussed in more detail in my previous post).

abadi blog clamp image abadiDespite the many conflicting arguments in favor of these different approaches, there is little question that column-stores compress data much better than row-stores. The reason is fairly intuitive: in a column-store, entire columns are stored contiguously --- in other words, a series of values from the same attribute domain are stored consecutively. In a row-store, values from different attribute domains are interspersed, thereby reducing the self-similarity of the data. In general the more self-similarity (lower entropy) you have in a dataset, the more compressible it is. Hence, column-stores are more compressible than row-stores.

In general, compression rates are very sensitive to the particular dataset that is being compressed. Therefore it is impossible to make any kind of guarantees about how much a particular database system/compression algorithm will compress an arbitrary dataset. However, as a general rule of thumb, it is reasonable to expect around 8X compression if a column-store is used on many kinds of datasets. 8X compression means that the compressed dataset is 1/8th the original size, and scan-based queries over the dataset can thus proceed approximately 8 times as fast. This stellar compression and resulting performance improvements are a major contributor to the recent popularity of column-stores.

It is precisely this renowned compression of column-stores which makes the compression rate of RainStor (a recent Teradata acquisition) so impressive in comparison. RainStor claims a factor of 5 times more compression than what column-stores are able to achieve on the same datasets, and 40X compression overall.

Although the reason why column-stores compress data better than row-stores is fairly intuitive, the reason why RainStor can compress data better than column-stores is less intuitive. Therefore, we will now explain this in more detail.

Take for example the following table, which is a subset of a table describing orders from a particular retail enterprise that sells bicycles and related parts. (A real table would have many more rows and columns, but we keep this example simple so that it is easier to understand what is going on).

Record Order date Ship date Product Price
1 03/22/2015 03/23/2015 “bicycle” 300
2 03/22/2015 03/24/2015 “lock” 18
3 03/22/2015 03/24/2015 “tire” 70
4 03/22/2015 03/23/2015 “lock” 18
5 03/22/2015 03/24/2015 “bicycle” 250
6 03/22/2015 03/23/2015 “bicycle” 280
7 03/22/2015 03/23/2015 “tire” 70
8 03/22/2015 03/23/2015 “lock” 18
9 03/22/2015 03/24/2015 “bicycle” 280
10 03/23/2015 03/24/2015 “lock” 18
11 03/23/2015 03/25/2015 “bicycle” 300
12 03/23/2015 03/24/2015 “bicycle” 280
13 03/23/2015 03/24/2015 “tire” 70
14 03/23/2015 03/25/2015 “bicycle” 250
15 03/23/2015 03/25/2015 “bicycle” 280


The table contains 15 records and shows four attributes --- the order and ship dates of a particular product; the product that was purchased, and the purchase price. Note that there is a relationship between some of these columns --- in particular the ship date is usually 1 or 2 days after the order date, and that the price of various products are usually consistent across orders, but there may be slight variations in price depending on what coupons the customer used to make the purchase.

A column-store would likely use “run-length encoding” to compress the order date column. Since records are sorted by order date, this would compress the column to its near-minimum --- it can be compressed as (03/22/2015, 9); (03/23/2015, 6) --- which indicates that 03/22/2015 is repeated 9 straight times, followed by 03/23/2015 which is repeated 6 times. The ship date column, although not sorted, is still very compressible, as each value can be expressed using a small number of bits in terms of how much larger (or smaller) it is from the previous value in the column. However, the other two columns --- product and price --- would likely be compressed using a variant of dictionary compression, where each value is mapped to the minimal number of bits needed represent it. For large datasets, where there are many unique values for price (or even for product), the number of bits needed to represent a dictionary entry is non-trivial, and the same dictionary entry is repeated in the compressed dataset for every repeated value in the original dataset.

In contrast, in RainStor, every unique value in the dataset is stored once (and only once), and every record is represented as a binary tree, where a breadth-first traversal of the tree enables the reconstruction of the original record. For example, the table shown above is compressed in RainStor using the forest of binary trees shown below. There are 15 binary trees (each of the 15 roots of these trees are shown using the green circles at the top of the figure), corresponding to the 15 records in the original dataset.abadi forest trees blog

Forest of Binary Trees Compression

For example, the binary tree corresponding to record 1 is shown on the left side of the figure. The root points to two children --- the internal nodes “A” and “E”. In turn, node “A” points to 03/22/2015 (corresponding to the order date of record 1), and to 03/23/2015 (corresponding to the ship date of record 1). Node “E” points to “bicycle” (corresponding to the product of record 1) and “300” corresponding to the price of record 1).

Note that records 4, 6, and 7 also have an order date of 03/22/2015 and a ship date of 03/23/2015. Therefore, the roots of the binary trees corresponding to those records also point to internal node “A”. Similarly, note that record 11 also is associated with the purchase of a bicycle for $300. Therefore, the root for record 11 also points to internal node “E”.

These shared internal nodes are what makes RainStor’s compression algorithm fundamentally different from any algorithm that a column-store is capable of performing. Column-stores are forced to create dictionaries and search for patterns only within individual columns. In contrast, RainStor’s compression algorithm finds patterns across different columns --- identifying the relationship between ship date and order date and the relationship between product and price, and leveraging these relationships to share branches in the trees that are formed, thereby eliminating redundant information. RainStor thus has fundamentally more room to search for patterns in the dataset and compress data by referencing these patterns via the (compressed) location of the root of the shared branch.

For a traditional archiving solution, compression rate is arguably the most important feature (right up there with immutability). Indeed, RainStor’s compression algorithm enables it to be used for archival use-cases, and RainStor provides all of the additional features you would expect from an archiving solution: encryption, LDAP/AD/PAM/Kerberos/PCI authentication and security, audit trails and logging, retention rules, expiry policies, and integrated implementation of existing compliance standards (e.g. SEC 17a-4).

However, what brings RainStor to the next level in the archival solutions market is that it is an “active” archive, meaning that the data that is managed by RainStor can be queried at high performance. RainStor provides a mature SQL stack for native querying of compressed RainStor data, including ANSI SQL 1992 and 2003 parsers, and a full MPP query execution engine. For enterprises with Hadoop clusters, RainStor is fully integrated with the Cloudera and Hortonworks distributions of Hadoop --- RainStor compressed data files can be partitioned over a HDFS cluster, and queried in parallel with HiveQL (or MapReduce or Pig). Furthermore, RainStor integrates with YARN for resource management, with HCatalog for metadata management, and with Ambari for system monitoring and management.

The reason why most archival solutions are not “active” is that the compression algorithms used to reduce the data size before archival are so heavy-weight, that significant processing resources must be invested in decompressing the data before it can be queried. Therefore, it is preferable to leave the data archived in compressed form, and only decompress it at times of significant need. In general, a user should expect significant query performance reductions relative to querying uncompressed data, in order to account for the additional decompression time.

The beauty of RainStor’s compression algorithm is that even though it gets compression ratios comparable to other archival products, its compression algorithm is not so heavy-weight that the data must be decompressed prior to querying it. In particular, the binary tree structures shown above are actually fairly straightforward to perform query operations on directly, without requiring decompression prior to access. For example, a count distinct or a group-by operation can be performed via a scan of the leaves of the binary tees. Furthermore, selections can be performed via a reverse traversal of the binary trees from the leaves that match the selection predicate. In general, since there is a one-to-one mapping of records in the uncompressed dataset to the binary trees in RainStor’s compressed files, all query operations can be expressed in terms of operations on these binary trees. Therefore, RainStor queries can benefit from the I/O improvement of scanning in less data (due to the smaller size of the compressed files on disk/memory) without paying the decompression cost to fully decompress these compressed files after they are read from storage. This leads to RainStor’s claims of 2X-100X performance improvement on most queries --- an industry-leading claim in the archival market.

In short, RainStor’s strong claims around compression and performance are backed up by the technology that is used under the covers. Its compression algorithm is able to identify and remove redundancy both within and across columns. Furthermore, the resulting data structures produced by the algorithm are amenable to direct operation on the compressed data. This allows the compressed files to be queried at high performance, and positions RainStor as a leading active-archive solution.


daniel abadi crop BLOG bio mgmt

Daniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata).!/daniel_abadi.

Funnel Analysis: an Approach from the Power Marketer Playbook

Posted on: March 31st, 2015 by Guest Blogger No Comments


funnel imagePower marketers are always interested in the most effective ways to track, measure, and analyze customer experiences for more relevant engagement. I’d like to share an approach that is less known yet potentially quite powerful.

Businesses across global markets are re-thinking data, analytics, platforms, and research methods to better understand their customers. Event analytics offers a new view of the customer, leveraging best technologies and diverse data sources, to obtain actionable insights in real time. Traditional methods help us understand consumers in terms of the following aspects: who, what, when, and where. Yet two of the most important questions for understanding consumers (“why” and “how”) are un-answered. The answers are key to obtaining business value because they can help us understand the why and how of consumers’ interactions with a company.

Traditional approaches focus on how the customer looks to the business. For example, what do you buy? What segments are you in? When was your last visit? However, the more important question should be “how does the business look to the customer?” How do our customers experience our products and brands? How do customers feel at each touch point?

One major advantage of event analytics over traditional methods is that it can improve our understanding of the customer’s view of the business. Traditional systems are not designed to solicit, extract and stitch together customer experience data well. Event analytics obtains information about the entire customer experience in detail, threading together many sources of information from different applications that combine to deliver the full view of customer experience.

To conduct event analytics, businesses need to create a “customer experience universe” that stitches customers’ experiences together, allows for easy behavior pattern recognition and facilitates visualizations of customer behaviors. This universe includes social media, customer experience, marketing channels, mobile apps, and devices. Then, machine learning algorithms are used to run through all the data to identify patterns.

Event analytics is an ecosystem that includes, for example, streaming ingestion of events, event repository, event metadata, guided user interface for business analysts and machine learning algorithms. One category of use cases is called funnel analytics which help us to understand customer behavioral patterns and what triggers their experiences.

Funnel analysis provides visibility across a series of customer experience events that lead towards a defined goal, say, from user engagement in a mobile app to a sale in an eCommerce platform. Funnel analyses are an effective way to calculate conversion rates on specific user behaviors, yet funnel analytics can be complex due to the difficulty in source categorization, visitor identification, pathing, attribution and conversion.

Funnels can be built using a single guided user interface without needing to write code or move data. As a result, event analytics can scale at the speed of business. It is a smart analytic approach because it helps create visibility to the path that users are most likely to follow to achieve their goals.

The value of having this insight is of great significance since it gives marketers a deep, data-driven line of sight into the customer experience universe.

James Semenak

James Semenak

James Semenak is a Principal Consultant with Teradata – known as an evangelist and architect for Event Analytics as well as Big Data Analytics and strategies.  James consults in all things related to data and analytics around the internet, and has worked with Shutterfly, Expedia, eBay Enterprise, Charles Schwab, Nokia, eBay, PayPal, Real Networks,, Electronic Arts, and Meredith Corp.



Lots of Big Data Talk, Little Big Data Action

Posted on: February 11th, 2015 by Data Analytics Staff No Comments


 Apps Are One Solution To Big Data Complexity

Offering big data apps is a great way for the analytics industry to put its muscle where its mouth is. Organizations face great hurdles in trying to benefit from the opportunities of big data.  Extracting rapid value from big data remains challenging.

To ease companies into realizing bankable big data benefits, Teradata has developed a collection of big data apps – pre-built templates that act as time-saving short cuts to value. Limited skill sets and complexity make it challenging for analytic professionals to rapidly and consistently derive actionable insights that can be easily operationalized.  Teradata is taking the lead in offering advanced analytic apps powered by Teradata Aster AppCenter to give sophisticated results from big data analytics.

The big data apps from Teradata are industry tailored analytical templates that address business challenges specific to the individual category. Purpose-built apps for retail address path to purchase and shopping cart abandonment.  Apps for healthcare map the paths to surgery and drug prescription affinity. Financial apps tackle omni-channel customer experiences and fraud.  The industries covered include consumer financial, entertainment and gaming, healthcare, manufacturing, retail, communications, travel and hospitality.

Big data apps are pre-built templates that can be further configured with help from Teradata professional services to address specific customer needs or goals.  Organizations have found that specialized big data analytic skills like Python, R, Java and MapReduce take time and require highly specialized manpower. Conversely, apps deliver fast time to value with self-service analytics. The purpose-built apps can be quickly deployed and configured/customized with minimal effort to deliver swift analytic value.

For app distribution, consumption and custom app development, the AppCenter makes big data analytics secure, scalable and repeatable by providing common services to build, deploy and consume apps.

With the apps and related solutions like AppCenter from Teradata, analytic professionals spend less time preparing data and more time doing discovery and iteration to find new insights and value.

Get more big data insights now!



Change and “Ah-Ha Moments”

Posted on: March 31st, 2014 by Data Analytics Staff No Comments


This is the first in a series of articles discussing the inherent nature of change and some useful suggestions for helping operationalize those “ah-ha moments."

Nobody has ever said that change is easy. It is a journey full of obstacles. But those obstacles are not impenetrable and with the right planning and communication, many of these obstacles can be cleared away making a more defined path for change to follow.   

So why is it that we often see failures that could have been avoided if changes that are obvious were not addressed before the problem occurred? The data was analyzed and yet nobody acted on these insights. Why does the organization fail to what I call operationalize the ah-ha moment? Was it a conscious decision? 

From the outside looking in it is easy to criticize organizations for not implementing obvious changes. But from the inside, there are many issues that cripple the efforts of change, and it usually boils down to time, people, process, technology or financial challenges.  

Companies make significant investments in business intelligence capabilities because they realize that hidden within the vast amounts of information they generate on a daily basis, there are jewels to be found that can provide valuable insights for the entire organization. For example, with today's analytic platforms business analysts in the marketing department have access to sophisticated tools that can mine information and uncover reasons for the high rate of churn occurring in their customer base. They might do this by analyzing all interactions and conversations taking place across the enterprise and the channels where customers engage the company. Using this data analysts then begin to  see various paths and patterns emerging from these interactions that ultimately lead to customer churn.   

These analysts have just discovered the leading causes of churn within their organization and are at the apex of the ah-ha moment. They now have the insights to stop the mass exodus of valuable customers and positively impact the bottom line. It’s obvious these insights would be acted upon and operationalized immediately, but that may not be the case. Perhaps the recently discovered patterns leading to customer churn touch so many internal systems, processes and organizations that getting organizational buy in to the necessary changes is mired down in a endless series of internal meetings.   

So what can be done given these realities? Here’s a quick list of tips that will help you enable change in your organization:

  • Someone needs to own the change and then lead rather than letting change lead him or her.
  • Make sure the reasons for change are well documented including measurable impacts and benefits for the organization.
  • When building a change management plan, identify the obstacles in the organization and make sure to build a mitigation plan for each.
    Communicate the needed changes through several channels.
  • Be clear when communicating change. Rumors can quickly derail or stall well thought out and planned change efforts.
  • When implementing changes make sure that the change is ready to be implemented and is fully tested.
  • Communicate the impact of the changes that have been deployed.  
  • Have enthusiastic people on the team and train them to be agents of change.
  • Establish credibility by building a proven track record that will give management the confidence that the team has the skills, creativity and discipline to implement these complex changes. 

Once implemented monitor the changes closely and anticipate that some changes will require further refinement. Remember that operationalizing the ah-ha moment is a journey.  A journey that can bring many valuable and rewarding benefits along the way. 

So, what’s your experience with operationalizing your "ah-ha moment"?

The integration issue that dare not speak its name ….

Posted on: March 25th, 2014 by Patrick Teunissen 2 Comments


Having worked with multinational companies running SAP ERP systems for many years, I know that they (nearly) always have more than one SAP system to record their transactional data. Yet it is never discussed -- and it seems to be the 'Macbeth' of the SAP world, a fact that should not be uttered out loud…

My first experience with SAP's software solutions dates back to1989 whilst at Shell Chemicals in the Netherlands, exactly 25 years ago. What strikes me most after all these years is that people talk about SAP as if it is one system covering everything that is important to business.

Undoubtedly SAP has had a huge impact on enterprise computing. I remember at Shell, prior to the implementation of SAP that we ran a vast quantity of transaction systems. The purchasing and stock management systems for example, were stand alone and not integrated with the general ledger system. The integration of these transaction systems had to be done via interfaces some of which were manual (information had to be typed over) At the month end, only after all interfaces had run, would the ledger show the proper stock value and accounts payable. So thanks to SAP the number of transaction systems has been dramatically reduced.

But of course the Shell Refining Company had its own SAP system just like the businesses in the UK, Germany etc etc. So in the late 80’s Shell ran multiple and numerous different SAP systems.

However this contradicts one of SAP’s key messages, their ability to integrate all sorts of transactional information to provide relevant data for analytical purposes in one hypothetical system (reference Dr. Plattner’s 2011 Beijing speech ).

I have always struggled with the definition of “relevant data” as I believe that what is relevant is dependent on 3 things: the user, the context and time. For an operator of a chemical plant for example, the current temperature of the unit and product conversion yields is likely to be “relevant” as this is the data needed to steer the current process. For the plant director the volumes produced and the overall processing efficiency of the last month maybe “relevant” as this is what his peers in the management team will challenge him on. SAP systems are as far as I know, not used to operate manufacturing plants, in which case the only conclusion can be that not all relevant data is in SAP. What you could say though, is that it is very likely that the “accounting” data is in SAP hence SAP could be the source for the plant’s management team reports.


However when businesses are running multiple SAP systems, as described earlier, the     conclusion cannot be that there is a (as in 1) SAP system in which all the relevant accounting data is processed. So a regional director responsible for numerous manufacturing sites may have to deal with data collected from multiple SAP systems when he/she needs to analyze the total costs of manufacturing of the last quarter.Probably because this does not really fit with SAP’s key message - one system for both transaction processing and analytics - they have no solution. I googled “analytics for multiple SAP systems” the results of which are shown above. As you can see other than the Teradata link there is no solution that will help our regional director. Even when the irrelevant words “analytics for” are removed only very technical and specific solutions are found.

Some people believe that this problem with analytics will be solved over time. Quite a few larger enterprises start with what I call re-implementations of the SAP software. Five years after my first exposure to SAP at Shell Chemicals in the Netherlands I became a member of the team responsible for the “re-implementation” of the software for Shell’s European Chemicals business. Of course there were cost benefits (less SAP systems = lower operational cost for the enterprise) and some supply chain related transactions could be processed more efficiently from the single system. But the region was still not really benefitting from it as the (national / legal) company in SAP is the most important object around which a lot has been organized (or configured) . Hence most multinational enterprises use another software product into which data is interfaced for the purpose of regional consolidation.

I was employed by Shell for almost 10 years. It is a special company and I am still in contact with a few people that I worked with. The other day I asked about the SAP landscape as it is today and was told that, 25 years after my first SAP experience they are still running multiple SAP systems and re-implementation projects. As I consider myself an expert in SAP I am sure I could have built a career on the re-implementation of the SAP systems.

The point that I want to make with this post is that many businesses need to take into account that they run multiple SAP systems, and more importantly that these systems are not automatically integrated. This fact has a huge impact on the analytics of the SAP data and the work required to provide an enterprise view of the business. So if you are involved in the delivery of analytical solutions to the organization then you should factor in “the Scottish play” issue into the heart of your design even if nobody else wants to talk about it.



2 This is why an appreciated colleague, a manufacturing consultant leader, always refers to SAP as the “Standard Accounting Package”.

3 In SAP the “Company” (T001-BUKRS) is probably the most important data object around which a lot has been organized (configured). Within SAP consolidation of these “companies’ is not an obvious thing to do. Extensions of the financial module (FI)designed to consolidate are difficult to operate and hardly ever used. Add to this the fact that almost every larger Enterprise has multiple SAP systems and the fact that consolidation takes place in “another” system is explained.

4 In 2007 SAP acquired OutlookSoft now known as SAP BPC (Business Planning & Consolidation) for this very purpose.

Teradata’s UDA is to Data as Prius is to Engines

Posted on: November 12th, 2013 by Teradata Aster No Comments


I’ve been working in the analytics and database market for 12 years. One of the most interesting pieces of that journey has been seeing how the market is ever-shifting. Both the technology and business trends during these short 12 years have massively changed not only the tech landscape today, but also the future of evolution of analytic technology. From a “buzz” perspective, I’ve seen “corporate initiatives” and “big ideas” come and go. Everything from “e-business intelligence,” which was a popular term when I first started working at Business Objects in 2001, to corporate performance management (CPM) and “the balanced scorecard.” From business process management (BPM) to “big data”, and now the architectures and tools that everyone is talking about.

The one golden thread that ties each of these terms, ideas and innovations together is that each is aiming to solve the questions related to what we are today calling “big data.” At the core of it all, we are searching for the right way to enable the explosion of data and analytics that today’s organizations are faced with, to simply be harnessed and understood. People call this the “logical data warehouse”, “big data architecture”, “next-generation data architecture”, “modern data architecture”, “unified data architecture”, or (I just saw last week) “unified data platform”.  What is all the fuss about, and what is really new?  My goal in this post and the next few will be to explain how the customers I work with are attacking the “big data” problem. We call it the Teradata Unified Data Architecture, but whatever you call it, the goals and concepts remain the same.

Mark Beyer from Gartner is credited with coining the term “logical data warehouse” and there is an interesting story and explanation. A nice summary of the term is,

The logical data warehouse is the next significant evolution of information integration because it includes ALL of its progenitors and demands that each piece of previously proven engineering in the architecture should be used in its best and most appropriate place.  …


… The logical data warehouse will finally provide the information services platform for the applications of the highly competitive companies and organizations in the early 21st Century.”

The idea of this next-generation architecture is simple: When organizations put ALL of their data to work, they can make smarter decisions.

It sounds easy, but as data volumes and data types explode, so does the need for more tools in your toolbox to help make sense of it all. Within your toolbox, data is NOT all nails and you definitely need to be armed with more than a hammer.

In my view, enterprise data architectures are evolving to let organizations capture more data. The data was previously untapped because the hardware costs required to store and process the enormous amount of data was simply too big. However, the declining costs of hardware (thanks to Moore’s law) have opened the door for more data (types, volumes, etc.) and processing technologies to be successful. But no singular technology can be engineered and optimized for every dimension of analytic processing including scale, performance or concurrent workloads.

Thus, organizations are creating best-of-breed architectures by taking advantage of new technologies and workload-specific platforms such as MapReduce, Hadoop, MPP data warehouses, discovery platforms and event processing, and putting them together into, a seamless, transparent and powerful analytic environment. This modern enterprise architecture enables users to get deep business insights and allows ALL data to be available to an organization, creating competitive advantage while lowering the total system cost.

But why not just throw all your data into files and put a search engine like Google on top? Why not just build a data warehouse and extend it with support for “unstructured” data? Because, in the world of big data, the one-size-sits-all approach simply doesn’t work.

Different technologies are more efficient at solving different analytical or processing problems. To steal an analogy from Dave Schrader—a colleague of mine—it’s not unlike a hybrid car. The Toyota Prius can average 47 mpg with hybrid (gas and electric) vs. 24 mpg with a “typical” gas-only car – almost double! But you do not pay twice as much for the car.

How’d they do it? Toyota engineered a system that uses gas when I need to accelerate fast (and also to recharge the battery at the same time), electric mostly when driving around town, and braking to recharge the battery.

Three components integrated seamlessly – the driver doesn’t need to know how it works.  It is the same idea with the Teradata UDA, which is a hybrid architecture for extracting the most insights per unit of time – at least doubling your insight capabilities at reasonable cost. And, business users don’t need to know all of the gory details. Teradata builds analytic engines—much like the hybrid drive train Toyota builds— that are optimized and used in combinations with different ecosystem tools depending on customer preferences and requirements, within their overall data architecture.

In the case of the hybrid car, battery power and braking systems, which recharge the battery, are the “new innovations” combined with gas-powered engines. Similarly, there are several innovations in data management and analytics that are shaping the unified data architecture, such as discovery platforms and Hadoop. Each customer’s architecture is different depending on requirements and preferences, but the Teradata Unified Data Architecture recommends three core components that are key components in a comprehensive architecture – a data platform (often called “Data Lake”), a discovery platform and an integrated data warehouse. There are other components such as event processing, search, and streaming which can be used in data architectures, but I’ll focus on the three core areas in this blog post.

Data Lakes

In many ways, this is not unlike the operational data store we’ve seen between transactional systems and the data warehouse, but the data lake is bigger and less structured. Any file can be “dumped” in the lake with no attention to data integration or transformation. New technologies like Hadoop provide a file-based approach to capturing large amounts of data without requiring ETL in advance. This enables large-scale data processing for data refining, structuring, and exploring data prior to downstream analysis in workload-specific systems, which are used to discover new insights and then move those insights into business operations for use by hundreds of end-users and applications.

Discovery Platforms

Discovery platforms are a new workload-specific system that is optimized to perform multiple analytic techniques in a single workflow to combine SQL with statistics, MapReduce, graph, or text analysis to look at data from multiple perspectives. The goal is to ultimately provide more granular and accurate insights to users about their business. Discovery Platforms enable a faster investigative analytical process to find new patterns in data, identify different types fraud or consumer behavior that traditional data mining approaches may have missed.

Integrated Data Warehouses

With all the excitement about what’s new, companies quickly forget the value of consistent, integrated data for reuse across the enterprise. The integrated data warehouse has become a mission-critical operational system which is the point of value realization or “operationalization” for information. The data within a massively parallel data warehouse has been cleansed, and provides a consistent source of data for enterprise analytics. By integrating relevant data from across the entire organization, a couple key goals are achieved. First, they can answer the kind of sophisticated, impactful questions that require cross-functional analyses. Second, they can answer questions more completely by making relevant data available across all levels of the organization. Data lakes (Hadoop) and discovery platforms complement the data warehouse by enriching it with new data and new insights that can now be delivered to 1000’s of users and applications with consistent performance (i.e., they get the information they need quickly).

A critical part of incorporating these novel approaches to data management and analytics is putting new insights and technologies into production in reliable, secure and manageable ways for organizations.  Fundamentals of master data management, metadata, security, data lineage, integrated data and reuse all still apply!

The excitement of experimenting with new technologies is fading. More and more, our customers are asking us about ways to put the power of new systems (and the insights they provide) into large-scale operation and production. This requires unified system management and monitoring, intelligent query routing, metadata about incoming data and the transformations applied throughout the data processing and analytical process, and role-based security that respects and applies data privacy, encryption and other policies required. This is where I will spend a good bit of time on my next blog post.