data management

 

It is well-known that there are two extreme alternatives for storing database tables on any storage media: storing it row-by-row (as done by traditional “row-store” technology) or storing it column-by-column (as done by recently popular “column-store” implementations). Row-stores store the entire first row of the table, followed by the entire second row of the table, etc. Column-stores store the entire first column of the table, followed by the entire second column of the table, etc. There have been huge amounts of research literature and commercial whitepapers that discuss the various advantages of these alternative approaches, along with various proposals for hybrid solutions (which I discussed in more detail in my previous post).

abadi blog clamp image abadiDespite the many conflicting arguments in favor of these different approaches, there is little question that column-stores compress data much better than row-stores. The reason is fairly intuitive: in a column-store, entire columns are stored contiguously --- in other words, a series of values from the same attribute domain are stored consecutively. In a row-store, values from different attribute domains are interspersed, thereby reducing the self-similarity of the data. In general the more self-similarity (lower entropy) you have in a dataset, the more compressible it is. Hence, column-stores are more compressible than row-stores.

In general, compression rates are very sensitive to the particular dataset that is being compressed. Therefore it is impossible to make any kind of guarantees about how much a particular database system/compression algorithm will compress an arbitrary dataset. However, as a general rule of thumb, it is reasonable to expect around 8X compression if a column-store is used on many kinds of datasets. 8X compression means that the compressed dataset is 1/8th the original size, and scan-based queries over the dataset can thus proceed approximately 8 times as fast. This stellar compression and resulting performance improvements are a major contributor to the recent popularity of column-stores.

It is precisely this renowned compression of column-stores which makes the compression rate of RainStor (a recent Teradata acquisition) so impressive in comparison. RainStor claims a factor of 5 times more compression than what column-stores are able to achieve on the same datasets, and 40X compression overall.

Although the reason why column-stores compress data better than row-stores is fairly intuitive, the reason why RainStor can compress data better than column-stores is less intuitive. Therefore, we will now explain this in more detail.

Take for example the following table, which is a subset of a table describing orders from a particular retail enterprise that sells bicycles and related parts. (A real table would have many more rows and columns, but we keep this example simple so that it is easier to understand what is going on).

Record Order date Ship date Product Price
1 03/22/2015 03/23/2015 “bicycle” 300
2 03/22/2015 03/24/2015 “lock” 18
3 03/22/2015 03/24/2015 “tire” 70
4 03/22/2015 03/23/2015 “lock” 18
5 03/22/2015 03/24/2015 “bicycle” 250
6 03/22/2015 03/23/2015 “bicycle” 280
7 03/22/2015 03/23/2015 “tire” 70
8 03/22/2015 03/23/2015 “lock” 18
9 03/22/2015 03/24/2015 “bicycle” 280
10 03/23/2015 03/24/2015 “lock” 18
11 03/23/2015 03/25/2015 “bicycle” 300
12 03/23/2015 03/24/2015 “bicycle” 280
13 03/23/2015 03/24/2015 “tire” 70
14 03/23/2015 03/25/2015 “bicycle” 250
15 03/23/2015 03/25/2015 “bicycle” 280

 

The table contains 15 records and shows four attributes --- the order and ship dates of a particular product; the product that was purchased, and the purchase price. Note that there is a relationship between some of these columns --- in particular the ship date is usually 1 or 2 days after the order date, and that the price of various products are usually consistent across orders, but there may be slight variations in price depending on what coupons the customer used to make the purchase.

A column-store would likely use “run-length encoding” to compress the order date column. Since records are sorted by order date, this would compress the column to its near-minimum --- it can be compressed as (03/22/2015, 9); (03/23/2015, 6) --- which indicates that 03/22/2015 is repeated 9 straight times, followed by 03/23/2015 which is repeated 6 times. The ship date column, although not sorted, is still very compressible, as each value can be expressed using a small number of bits in terms of how much larger (or smaller) it is from the previous value in the column. However, the other two columns --- product and price --- would likely be compressed using a variant of dictionary compression, where each value is mapped to the minimal number of bits needed represent it. For large datasets, where there are many unique values for price (or even for product), the number of bits needed to represent a dictionary entry is non-trivial, and the same dictionary entry is repeated in the compressed dataset for every repeated value in the original dataset.

In contrast, in RainStor, every unique value in the dataset is stored once (and only once), and every record is represented as a binary tree, where a breadth-first traversal of the tree enables the reconstruction of the original record. For example, the table shown above is compressed in RainStor using the forest of binary trees shown below. There are 15 binary trees (each of the 15 roots of these trees are shown using the green circles at the top of the figure), corresponding to the 15 records in the original dataset.abadi forest trees blog

Forest of Binary Trees Compression

For example, the binary tree corresponding to record 1 is shown on the left side of the figure. The root points to two children --- the internal nodes “A” and “E”. In turn, node “A” points to 03/22/2015 (corresponding to the order date of record 1), and to 03/23/2015 (corresponding to the ship date of record 1). Node “E” points to “bicycle” (corresponding to the product of record 1) and “300” corresponding to the price of record 1).

Note that records 4, 6, and 7 also have an order date of 03/22/2015 and a ship date of 03/23/2015. Therefore, the roots of the binary trees corresponding to those records also point to internal node “A”. Similarly, note that record 11 also is associated with the purchase of a bicycle for $300. Therefore, the root for record 11 also points to internal node “E”.

These shared internal nodes are what makes RainStor’s compression algorithm fundamentally different from any algorithm that a column-store is capable of performing. Column-stores are forced to create dictionaries and search for patterns only within individual columns. In contrast, RainStor’s compression algorithm finds patterns across different columns --- identifying the relationship between ship date and order date and the relationship between product and price, and leveraging these relationships to share branches in the trees that are formed, thereby eliminating redundant information. RainStor thus has fundamentally more room to search for patterns in the dataset and compress data by referencing these patterns via the (compressed) location of the root of the shared branch.

For a traditional archiving solution, compression rate is arguably the most important feature (right up there with immutability). Indeed, RainStor’s compression algorithm enables it to be used for archival use-cases, and RainStor provides all of the additional features you would expect from an archiving solution: encryption, LDAP/AD/PAM/Kerberos/PCI authentication and security, audit trails and logging, retention rules, expiry policies, and integrated implementation of existing compliance standards (e.g. SEC 17a-4).

However, what brings RainStor to the next level in the archival solutions market is that it is an “active” archive, meaning that the data that is managed by RainStor can be queried at high performance. RainStor provides a mature SQL stack for native querying of compressed RainStor data, including ANSI SQL 1992 and 2003 parsers, and a full MPP query execution engine. For enterprises with Hadoop clusters, RainStor is fully integrated with the Cloudera and Hortonworks distributions of Hadoop --- RainStor compressed data files can be partitioned over a HDFS cluster, and queried in parallel with HiveQL (or MapReduce or Pig). Furthermore, RainStor integrates with YARN for resource management, with HCatalog for metadata management, and with Ambari for system monitoring and management.

The reason why most archival solutions are not “active” is that the compression algorithms used to reduce the data size before archival are so heavy-weight, that significant processing resources must be invested in decompressing the data before it can be queried. Therefore, it is preferable to leave the data archived in compressed form, and only decompress it at times of significant need. In general, a user should expect significant query performance reductions relative to querying uncompressed data, in order to account for the additional decompression time.

The beauty of RainStor’s compression algorithm is that even though it gets compression ratios comparable to other archival products, its compression algorithm is not so heavy-weight that the data must be decompressed prior to querying it. In particular, the binary tree structures shown above are actually fairly straightforward to perform query operations on directly, without requiring decompression prior to access. For example, a count distinct or a group-by operation can be performed via a scan of the leaves of the binary tees. Furthermore, selections can be performed via a reverse traversal of the binary trees from the leaves that match the selection predicate. In general, since there is a one-to-one mapping of records in the uncompressed dataset to the binary trees in RainStor’s compressed files, all query operations can be expressed in terms of operations on these binary trees. Therefore, RainStor queries can benefit from the I/O improvement of scanning in less data (due to the smaller size of the compressed files on disk/memory) without paying the decompression cost to fully decompress these compressed files after they are read from storage. This leads to RainStor’s claims of 2X-100X performance improvement on most queries --- an industry-leading claim in the archival market.

In short, RainStor’s strong claims around compression and performance are backed up by the technology that is used under the covers. Its compression algorithm is able to identify and remove redundancy both within and across columns. Furthermore, the resulting data structures produced by the algorithm are amenable to direct operation on the compressed data. This allows the compressed files to be queried at high performance, and positions RainStor as a leading active-archive solution.

_________________________________________________________________________

daniel abadi crop BLOG bio mgmt

Daniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). http://twitter.com/#!/daniel_abadi.

Funnel Analysis: an Approach from the Power Marketer Playbook

Posted on: March 31st, 2015 by Guest Blogger No Comments

 

funnel imagePower marketers are always interested in the most effective ways to track, measure, and analyze customer experiences for more relevant engagement. I’d like to share an approach that is less known yet potentially quite powerful.

Businesses across global markets are re-thinking data, analytics, platforms, and research methods to better understand their customers. Event analytics offers a new view of the customer, leveraging best technologies and diverse data sources, to obtain actionable insights in real time. Traditional methods help us understand consumers in terms of the following aspects: who, what, when, and where. Yet two of the most important questions for understanding consumers (“why” and “how”) are un-answered. The answers are key to obtaining business value because they can help us understand the why and how of consumers’ interactions with a company.

Traditional approaches focus on how the customer looks to the business. For example, what do you buy? What segments are you in? When was your last visit? However, the more important question should be “how does the business look to the customer?” How do our customers experience our products and brands? How do customers feel at each touch point?

One major advantage of event analytics over traditional methods is that it can improve our understanding of the customer’s view of the business. Traditional systems are not designed to solicit, extract and stitch together customer experience data well. Event analytics obtains information about the entire customer experience in detail, threading together many sources of information from different applications that combine to deliver the full view of customer experience.

To conduct event analytics, businesses need to create a “customer experience universe” that stitches customers’ experiences together, allows for easy behavior pattern recognition and facilitates visualizations of customer behaviors. This universe includes social media, customer experience, marketing channels, mobile apps, and devices. Then, machine learning algorithms are used to run through all the data to identify patterns.

Event analytics is an ecosystem that includes, for example, streaming ingestion of events, event repository, event metadata, guided user interface for business analysts and machine learning algorithms. One category of use cases is called funnel analytics which help us to understand customer behavioral patterns and what triggers their experiences.

Funnel analysis provides visibility across a series of customer experience events that lead towards a defined goal, say, from user engagement in a mobile app to a sale in an eCommerce platform. Funnel analyses are an effective way to calculate conversion rates on specific user behaviors, yet funnel analytics can be complex due to the difficulty in source categorization, visitor identification, pathing, attribution and conversion.

Funnels can be built using a single guided user interface without needing to write code or move data. As a result, event analytics can scale at the speed of business. It is a smart analytic approach because it helps create visibility to the path that users are most likely to follow to achieve their goals.

The value of having this insight is of great significance since it gives marketers a deep, data-driven line of sight into the customer experience universe.

James Semenak

James Semenak

James Semenak is a Principal Consultant with Teradata – known as an evangelist and architect for Event Analytics as well as Big Data Analytics and strategies.  James consults in all things related to data and analytics around the internet, and has worked with Shutterfly, Expedia, eBay Enterprise, Charles Schwab, Nokia, eBay, PayPal, Real Networks, Overstock.com, Electronic Arts, and Meredith Corp.

 

 

Lots of Big Data Talk, Little Big Data Action

Posted on: February 11th, 2015 by Manan Goel No Comments

 

 Apps Are One Solution To Big Data Complexity

Offering big data apps is a great way for the analytics industry to put its muscle where its mouth is. Organizations face great hurdles in trying to benefit from the opportunities of big data.  Extracting rapid value from big data remains challenging.

To ease companies into realizing bankable big data benefits, Teradata has developed a collection of big data apps – pre-built templates that act as time-saving short cuts to value. Limited skill sets and complexity make it challenging for analytic professionals to rapidly and consistently derive actionable insights that can be easily operationalized.  Teradata is taking the lead in offering advanced analytic apps powered by Teradata Aster AppCenter to give sophisticated results from big data analytics.

The big data apps from Teradata are industry tailored analytical templates that address business challenges specific to the individual category. Purpose-built apps for retail address path to purchase and shopping cart abandonment.  Apps for healthcare map the paths to surgery and drug prescription affinity. Financial apps tackle omni-channel customer experiences and fraud.  The industries covered include consumer financial, entertainment and gaming, healthcare, manufacturing, retail, communications, travel and hospitality.

Big data apps are pre-built templates that can be further configured with help from Teradata professional services to address specific customer needs or goals.  Organizations have found that specialized big data analytic skills like Python, R, Java and MapReduce take time and require highly specialized manpower. Conversely, apps deliver fast time to value with self-service analytics. The purpose-built apps can be quickly deployed and configured/customized with minimal effort to deliver swift analytic value.

For app distribution, consumption and custom app development, the AppCenter makes big data analytics secure, scalable and repeatable by providing common services to build, deploy and consume apps.

With the apps and related solutions like AppCenter from Teradata, analytic professionals spend less time preparing data and more time doing discovery and iteration to find new insights and value.

Get more big data insights now!

 

 

Change and “Ah-Ha Moments”

Posted on: March 31st, 2014 by Ray Wilson No Comments

 

This is the first in a series of articles discussing the inherent nature of change and some useful suggestions for helping operationalize those “ah-ha moments."

Nobody has ever said that change is easy. It is a journey full of obstacles. But those obstacles are not impenetrable and with the right planning and communication, many of these obstacles can be cleared away making a more defined path for change to follow.   

So why is it that we often see failures that could have been avoided if changes that are obvious were not addressed before the problem occurred? The data was analyzed and yet nobody acted on these insights. Why does the organization fail to what I call operationalize the ah-ha moment? Was it a conscious decision? 

From the outside looking in it is easy to criticize organizations for not implementing obvious changes. But from the inside, there are many issues that cripple the efforts of change, and it usually boils down to time, people, process, technology or financial challenges.  

Companies make significant investments in business intelligence capabilities because they realize that hidden within the vast amounts of information they generate on a daily basis, there are jewels to be found that can provide valuable insights for the entire organization. For example, with today's analytic platforms business analysts in the marketing department have access to sophisticated tools that can mine information and uncover reasons for the high rate of churn occurring in their customer base. They might do this by analyzing all interactions and conversations taking place across the enterprise and the channels where customers engage the company. Using this data analysts then begin to  see various paths and patterns emerging from these interactions that ultimately lead to customer churn.   

These analysts have just discovered the leading causes of churn within their organization and are at the apex of the ah-ha moment. They now have the insights to stop the mass exodus of valuable customers and positively impact the bottom line. It’s obvious these insights would be acted upon and operationalized immediately, but that may not be the case. Perhaps the recently discovered patterns leading to customer churn touch so many internal systems, processes and organizations that getting organizational buy in to the necessary changes is mired down in a endless series of internal meetings.   

So what can be done given these realities? Here’s a quick list of tips that will help you enable change in your organization:

  • Someone needs to own the change and then lead rather than letting change lead him or her.
  • Make sure the reasons for change are well documented including measurable impacts and benefits for the organization.
  • When building a change management plan, identify the obstacles in the organization and make sure to build a mitigation plan for each.
    Communicate the needed changes through several channels.
  • Be clear when communicating change. Rumors can quickly derail or stall well thought out and planned change efforts.
  • When implementing changes make sure that the change is ready to be implemented and is fully tested.
  • Communicate the impact of the changes that have been deployed.  
  • Have enthusiastic people on the team and train them to be agents of change.
  • Establish credibility by building a proven track record that will give management the confidence that the team has the skills, creativity and discipline to implement these complex changes. 

Once implemented monitor the changes closely and anticipate that some changes will require further refinement. Remember that operationalizing the ah-ha moment is a journey.  A journey that can bring many valuable and rewarding benefits along the way. 

So, what’s your experience with operationalizing your "ah-ha moment"?

The integration issue that dare not speak its name ….

Posted on: March 25th, 2014 by Patrick Teunissen 2 Comments

 

Having worked with multinational companies running SAP ERP systems for many years, I know that they (nearly) always have more than one SAP system to record their transactional data. Yet it is never discussed -- and it seems to be the 'Macbeth' of the SAP world, a fact that should not be uttered out loud…

My first experience with SAP's software solutions dates back to1989 whilst at Shell Chemicals in the Netherlands, exactly 25 years ago. What strikes me most after all these years is that people talk about SAP as if it is one system covering everything that is important to business.

Undoubtedly SAP has had a huge impact on enterprise computing. I remember at Shell, prior to the implementation of SAP that we ran a vast quantity of transaction systems. The purchasing and stock management systems for example, were stand alone and not integrated with the general ledger system. The integration of these transaction systems had to be done via interfaces some of which were manual (information had to be typed over) At the month end, only after all interfaces had run, would the ledger show the proper stock value and accounts payable. So thanks to SAP the number of transaction systems has been dramatically reduced.

But of course the Shell Refining Company had its own SAP system just like the businesses in the UK, Germany etc etc. So in the late 80’s Shell ran multiple and numerous different SAP systems.

However this contradicts one of SAP’s key messages, their ability to integrate all sorts of transactional information to provide relevant data for analytical purposes in one hypothetical system (reference Dr. Plattner’s 2011 Beijing speech ).

I have always struggled with the definition of “relevant data” as I believe that what is relevant is dependent on 3 things: the user, the context and time. For an operator of a chemical plant for example, the current temperature of the unit and product conversion yields is likely to be “relevant” as this is the data needed to steer the current process. For the plant director the volumes produced and the overall processing efficiency of the last month maybe “relevant” as this is what his peers in the management team will challenge him on. SAP systems are as far as I know, not used to operate manufacturing plants, in which case the only conclusion can be that not all relevant data is in SAP. What you could say though, is that it is very likely that the “accounting” data is in SAP hence SAP could be the source for the plant’s management team reports.

 

However when businesses are running multiple SAP systems, as described earlier, the     conclusion cannot be that there is a (as in 1) SAP system in which all the relevant accounting data is processed. So a regional director responsible for numerous manufacturing sites may have to deal with data collected from multiple SAP systems when he/she needs to analyze the total costs of manufacturing of the last quarter.Probably because this does not really fit with SAP’s key message - one system for both transaction processing and analytics - they have no solution. I googled “analytics for multiple SAP systems” the results of which are shown above. As you can see other than the Teradata link there is no solution that will help our regional director. Even when the irrelevant words “analytics for” are removed only very technical and specific solutions are found.

Some people believe that this problem with analytics will be solved over time. Quite a few larger enterprises start with what I call re-implementations of the SAP software. Five years after my first exposure to SAP at Shell Chemicals in the Netherlands I became a member of the team responsible for the “re-implementation” of the software for Shell’s European Chemicals business. Of course there were cost benefits (less SAP systems = lower operational cost for the enterprise) and some supply chain related transactions could be processed more efficiently from the single system. But the region was still not really benefitting from it as the (national / legal) company in SAP is the most important object around which a lot has been organized (or configured) . Hence most multinational enterprises use another software product into which data is interfaced for the purpose of regional consolidation.

I was employed by Shell for almost 10 years. It is a special company and I am still in contact with a few people that I worked with. The other day I asked about the SAP landscape as it is today and was told that, 25 years after my first SAP experience they are still running multiple SAP systems and re-implementation projects. As I consider myself an expert in SAP I am sure I could have built a career on the re-implementation of the SAP systems.

The point that I want to make with this post is that many businesses need to take into account that they run multiple SAP systems, and more importantly that these systems are not automatically integrated. This fact has a huge impact on the analytics of the SAP data and the work required to provide an enterprise view of the business. So if you are involved in the delivery of analytical solutions to the organization then you should factor in “the Scottish play” issue into the heart of your design even if nobody else wants to talk about it.

Notes:

1 http://events.sap.com/sapphirenow/en/session/871 

2 This is why an appreciated colleague, a manufacturing consultant leader, always refers to SAP as the “Standard Accounting Package”.

3 In SAP the “Company” (T001-BUKRS) is probably the most important data object around which a lot has been organized (configured). Within SAP consolidation of these “companies’ is not an obvious thing to do. Extensions of the financial module (FI)designed to consolidate are difficult to operate and hardly ever used. Add to this the fact that almost every larger Enterprise has multiple SAP systems and the fact that consolidation takes place in “another” system is explained.

4 In 2007 SAP acquired OutlookSoft now known as SAP BPC (Business Planning & Consolidation) for this very purpose.

Teradata’s UDA is to Data as Prius is to Engines

Posted on: November 12th, 2013 by Teradata Aster No Comments

 

I’ve been working in the analytics and database market for 12 years. One of the most interesting pieces of that journey has been seeing how the market is ever-shifting. Both the technology and business trends during these short 12 years have massively changed not only the tech landscape today, but also the future of evolution of analytic technology. From a “buzz” perspective, I’ve seen “corporate initiatives” and “big ideas” come and go. Everything from “e-business intelligence,” which was a popular term when I first started working at Business Objects in 2001, to corporate performance management (CPM) and “the balanced scorecard.” From business process management (BPM) to “big data”, and now the architectures and tools that everyone is talking about.

The one golden thread that ties each of these terms, ideas and innovations together is that each is aiming to solve the questions related to what we are today calling “big data.” At the core of it all, we are searching for the right way to enable the explosion of data and analytics that today’s organizations are faced with, to simply be harnessed and understood. People call this the “logical data warehouse”, “big data architecture”, “next-generation data architecture”, “modern data architecture”, “unified data architecture”, or (I just saw last week) “unified data platform”.  What is all the fuss about, and what is really new?  My goal in this post and the next few will be to explain how the customers I work with are attacking the “big data” problem. We call it the Teradata Unified Data Architecture, but whatever you call it, the goals and concepts remain the same.

Mark Beyer from Gartner is credited with coining the term “logical data warehouse” and there is an interesting story and explanation. A nice summary of the term is,

The logical data warehouse is the next significant evolution of information integration because it includes ALL of its progenitors and demands that each piece of previously proven engineering in the architecture should be used in its best and most appropriate place.  …

And

… The logical data warehouse will finally provide the information services platform for the applications of the highly competitive companies and organizations in the early 21st Century.”

The idea of this next-generation architecture is simple: When organizations put ALL of their data to work, they can make smarter decisions.

It sounds easy, but as data volumes and data types explode, so does the need for more tools in your toolbox to help make sense of it all. Within your toolbox, data is NOT all nails and you definitely need to be armed with more than a hammer.

In my view, enterprise data architectures are evolving to let organizations capture more data. The data was previously untapped because the hardware costs required to store and process the enormous amount of data was simply too big. However, the declining costs of hardware (thanks to Moore’s law) have opened the door for more data (types, volumes, etc.) and processing technologies to be successful. But no singular technology can be engineered and optimized for every dimension of analytic processing including scale, performance or concurrent workloads.

Thus, organizations are creating best-of-breed architectures by taking advantage of new technologies and workload-specific platforms such as MapReduce, Hadoop, MPP data warehouses, discovery platforms and event processing, and putting them together into, a seamless, transparent and powerful analytic environment. This modern enterprise architecture enables users to get deep business insights and allows ALL data to be available to an organization, creating competitive advantage while lowering the total system cost.

But why not just throw all your data into files and put a search engine like Google on top? Why not just build a data warehouse and extend it with support for “unstructured” data? Because, in the world of big data, the one-size-sits-all approach simply doesn’t work.

Different technologies are more efficient at solving different analytical or processing problems. To steal an analogy from Dave Schrader—a colleague of mine—it’s not unlike a hybrid car. The Toyota Prius can average 47 mpg with hybrid (gas and electric) vs. 24 mpg with a “typical” gas-only car – almost double! But you do not pay twice as much for the car.

How’d they do it? Toyota engineered a system that uses gas when I need to accelerate fast (and also to recharge the battery at the same time), electric mostly when driving around town, and braking to recharge the battery.

Three components integrated seamlessly – the driver doesn’t need to know how it works.  It is the same idea with the Teradata UDA, which is a hybrid architecture for extracting the most insights per unit of time – at least doubling your insight capabilities at reasonable cost. And, business users don’t need to know all of the gory details. Teradata builds analytic engines—much like the hybrid drive train Toyota builds— that are optimized and used in combinations with different ecosystem tools depending on customer preferences and requirements, within their overall data architecture.

In the case of the hybrid car, battery power and braking systems, which recharge the battery, are the “new innovations” combined with gas-powered engines. Similarly, there are several innovations in data management and analytics that are shaping the unified data architecture, such as discovery platforms and Hadoop. Each customer’s architecture is different depending on requirements and preferences, but the Teradata Unified Data Architecture recommends three core components that are key components in a comprehensive architecture – a data platform (often called “Data Lake”), a discovery platform and an integrated data warehouse. There are other components such as event processing, search, and streaming which can be used in data architectures, but I’ll focus on the three core areas in this blog post.

Data Lakes

In many ways, this is not unlike the operational data store we’ve seen between transactional systems and the data warehouse, but the data lake is bigger and less structured. Any file can be “dumped” in the lake with no attention to data integration or transformation. New technologies like Hadoop provide a file-based approach to capturing large amounts of data without requiring ETL in advance. This enables large-scale data processing for data refining, structuring, and exploring data prior to downstream analysis in workload-specific systems, which are used to discover new insights and then move those insights into business operations for use by hundreds of end-users and applications.

Discovery Platforms

Discovery platforms are a new workload-specific system that is optimized to perform multiple analytic techniques in a single workflow to combine SQL with statistics, MapReduce, graph, or text analysis to look at data from multiple perspectives. The goal is to ultimately provide more granular and accurate insights to users about their business. Discovery Platforms enable a faster investigative analytical process to find new patterns in data, identify different types fraud or consumer behavior that traditional data mining approaches may have missed.

Integrated Data Warehouses

With all the excitement about what’s new, companies quickly forget the value of consistent, integrated data for reuse across the enterprise. The integrated data warehouse has become a mission-critical operational system which is the point of value realization or “operationalization” for information. The data within a massively parallel data warehouse has been cleansed, and provides a consistent source of data for enterprise analytics. By integrating relevant data from across the entire organization, a couple key goals are achieved. First, they can answer the kind of sophisticated, impactful questions that require cross-functional analyses. Second, they can answer questions more completely by making relevant data available across all levels of the organization. Data lakes (Hadoop) and discovery platforms complement the data warehouse by enriching it with new data and new insights that can now be delivered to 1000’s of users and applications with consistent performance (i.e., they get the information they need quickly).

A critical part of incorporating these novel approaches to data management and analytics is putting new insights and technologies into production in reliable, secure and manageable ways for organizations.  Fundamentals of master data management, metadata, security, data lineage, integrated data and reuse all still apply!

The excitement of experimenting with new technologies is fading. More and more, our customers are asking us about ways to put the power of new systems (and the insights they provide) into large-scale operation and production. This requires unified system management and monitoring, intelligent query routing, metadata about incoming data and the transformations applied throughout the data processing and analytical process, and role-based security that respects and applies data privacy, encryption and other policies required. This is where I will spend a good bit of time on my next blog post.