One way to look at progress in technology is to recognize that each new generation provides a better version of what we’ve always wanted. If you look back at the claims for Hollerith punch card-based computing or the first generation of IBM mainframes, you find that the language is recognizable and can be found in marketing material for modern technology.

This year’s model of technology (and those from 50 or 100 years ago) will provide more efficiency, transparency, automation, and productivity. Yeehaw! I can’t wait. Oh, by the way, the current generation of big data technology will provide the same thing.

And, in fact, every generation of technology has fulfilled these enduring promises, improving on what was achieved in the past. What is important to understand is how. It is often the case that in emphasizing the “new newness” of what is coming down the pike, we forget about essential elements of value in the generation of technology that is being surpassed.

This pattern is alive and well in the current transformation taking place in the world of IT related to the arrival of big data technology, which is changing so many things for the better. The problem is that exaggerations about one aspect of what is new about big data processing, “schema on read” — the ability to add structure at the last minute — is obscuring the need for a process to design and communicate a standard structure for your data, which is called “schema on write.”

Here’s the problem in a nutshell:
• In the past, the entire structure of a database was designed at the beginning of a project. The questions that needed to be answered determined the data that needed to be provided, and well-understood methods were created to model that data, that is, to provide structure so that the questions could be answered. The idea of “schema on write” is that you couldn’t really store the data until you had determined its structure.
• The world of relational database technology and the SQL language was used to answer the questions, which was a huge improvement from having to write a custom program to process each query.
• But as time passed, more data arrived and more questions needed to be answered. It became challenging to manage and change the model in an orderly fashion. People wanted to use new data and answer new questions faster than they could by waiting to get the model changed.

Okay, let’s stop and look at the good and the bad so far. The good is that structure allowed data to be used more efficiently. The more people who used the structure, the more value it created. So, when you have thousands of users asking questions and getting answers from thousands of tables, everything is super great. Taking the time to manage the structure and get it right is worth it. Schema on write is, after all, what drives business fundamentals, such as finance.

But the world is changing fast and new data is arriving all the time, which is not the strength of schema on write. If a department wants to use a new dataset, staff can’t wait for a long process where the central model is changed and the new data arrives. It’s not even clear whether every new source of data should be added to the central model. Unless a large number of people are going to use it, why bother? For discovery, schema on read makes excellent sense.

Self-service technologies like spreadsheets and other great technology for data discovery are used to find answers from this new data. What is lost in this process is the fact that almost all of this data has structure that must be described in some way before the data is used. In a spreadsheet, you need to parse most data into columns. The end-user or analyst does this sort of modeling, not the central keeper of the database, the database administrator, or some other specialist. One thing to note about this sort of modeling is that it is done to support a particular purpose. It is not done to support thousands of users. In fact, adding this sort of structure to data is not generally thought of as modeling, but it is.

Schema on write drives the business forward. So, for big data, for any data, structure must be captured and managed. The most profound evidence of this is the way that all of the “born digital” companies such as Facebook, Netflix, LinkedIn, and Twitter have added large scale SQL databases to their data platforms. These companies were forced to implement schema on write by the needs and scale of their businesses.

Schema on read leads to massive discoveries. Schema on write operationalizes them. They are not at odds; both contribute to the process of understanding data and making it useful. To make the most of all their data, businesses need both schema on read and schema on write.

Dan-Woods Data Points Teradata

Dan Woods is CTO and founder of CITO Research. He has written more than 20 books about the strategic intersection of business and technology. Dan writes about data science, cloud computing, mobility, and IT management in articles, books, and blogs, as well as in his popular column on

How Analytics Turns IoT Data into Dollars

Posted on: October 19th, 2015 by Chris Twogood No Comments


The buzz around the term “Internet of Things” (IoT) amplifies with each passing day. It’s taking some time, however, for everyone to fully comprehend just how valuable this phenomenon has become for our world and our economy. Part of this has to do with the learning curve in understanding the sophisticated technologies and analytics involved. But part of it is the sheer, staggering scope of value that’s possible worldwide. A comprehensive study in June 2015 by the McKinsey Global Institute, in fact, concluded that IoT is one of those rare technology trends where the “hype may actually understate the full potential.”

The Internet of Things is our constantly growing universe of sensors and devices that create a flood of granular data about our world. The “things” include everything from environmental sensors monitoring weather, traffic or energy usage; to “smart” household appliances and telemetry from production-line machines and car engines. These sensors are constantly getting smarter, cheaper and smaller (many sensors today are smaller than a dime, and we’ll eventually see smart dust: thousands of small processors that look like dust and are sprinkled on surfaces, swallowed or poured.)

Smart Analytics Drive IoT Value

As the volume and variety of sensors and other telemetry sources grows, the connections between them and the analytic needs also grow to create an IoT value curve that’s rising exponentially as time goes on. IDC predicts the installed base of IoT connected things will reach more than 29.5 billion in 2020, with economic value-add across sectors by then topping $1.9 trillion. For all the focus on sensors and connections, however, the key driver of value is the analytics we can apply to reap insights and competitive advantage.

As we build better algorithms for the burgeoning IoT digital infrastructure, we are learning to use connection-based “smart analytics” to get very proactive in predicting future performance and conditions and even prescribing future actions. What if we could predict such a failure before it ever happens? With advanced smart analytics today, we can. It’s called predictive maintenance and it utilizes a probability-based “Weibull distribution” and other advanced processes to gauge “time to failure” rates so we can predict a machine or device breakdown before it happens.

One major provider of medical diagnostic and treatment machines has leveraged predictive maintenance to create “wearout models” for component parts in its products. This enabled early detection and identification of problems, as well as proactive root cause analysis to prevent down time and unplanned outages. A large European train manufacturer, meanwhile, is leveraging similar techniques to prevent train engine failure. It’s a key capability that has enabled the firm to expand into the leasing market – a line of business that’s profitable only if your trains remain operational.

Building IoT Architectures

There is really no limit to how far we can take this alchemy of sensors, connections and algorithms to create more and more complex systems and solutions to the problems facing businesses. But success remains impossible without the right analytics architectures in place. Most companies today still struggle to capitalize and make use of all this IoT data.

Indeed, McKinsey’s June 2015 IoT report found that less than one percent of IoT data is currently used; and those uses tend to be straightforward things like alarm activation or real-time controls rather than advanced analytics that can help optimize business processes or make predictions.

Even the most tech-savvy businesses are now realizing that extracting value from the data is a difficult and skills-intensive process. Top priorities include intelligent “listening” to massive streams of IoT data to uncover distinctive patterns that may be signposts to valuable insights. We must ingest and propagate that data in an analytical ecosystem advanced machine learning algorithms, operating at scale to reap sophisticated, actionable insights.

Agility is key: Architectures need to follow multiple streams of sensor and IoT data in real-time and deploy an agile central ingestion platform to economically and reliably listen to all relevant data. Architectures also should be configured to deploy advanced analytics – including machine learning, path, pattern, time series, statistics, graph, and text analytics – against massive volumes of data. The entire environment should be thoroughly self-service to enable rapid innovation of any new data set and avoid bogging down IT personnel with costly, requirements-driven custom projects.

These are the kind of capabilities companies must pursue to economically spot and act upon new business opportunities made possible by the Internet of Things. It takes a good deal of investment and strategic planning, but the payoff in terms of analytic insights, competitive advantage and future revenue is well worth it.

Simplifying SAP R/3 is irrelevant for users

Posted on: October 14th, 2015 by Patrick Teunissen 3 Comments


Part two of a series about an old ‘SAP’ dog who learns a new trick

Today more than ever, SAP is more focused on technology (HANA) than data. When they do focus on data, they talk about simplifying it because simplification is necessary to make said technology work better.

In SAP terms, simplification means fewer tables - a feat which is achieved by dropping aggregate tables and support structures. These are very important simplifications when dealing with expensive in-memory technology because building up aggregates eats up processing capacity and fills up the database with redundant data. That stuff I get, but in the grand scheme of data and analytics the discussions about Simplicity and In-Memory are irrelevant because they are small pieces of the analytics puzzle.

The continuing impediment and struggle to getting value from SAP is in the integration of data. I’ve previously written about the fact that many large companies are running multiple SAP R/3 (ECC, ERP) systems. HANA running as a database under R/3 or S/4 does not solve this issue. It should make BW redundant (finally) - but I do not see how that will resolve the multiple ERP issue.

To take it a step further, Big Data (read: Non SAP Data) is becoming more important for analytical purposes. As big data grows exponentially, innovations like the logical data warehouse and Hadoop make it possible to store, integrate, and analyze all data for deeper insights.

SAP Blog Oct 14The chart here - clearly shows that the share of SAP R/3 data that is relevant for decision making decreases over time. This means the data needed for today’s (and tomorrow’s) analytics is increasing reliant on non-SAP sources. Again, I don’t see how HANA or S/4 solves this integration issue.

Note: That does not mean it has become irrelevant, to the contrary see my previous blog, but people should not believe a simplified, faster running R/3 (or S/4) is enough for analytics today. Next I will write about the value of integrating CRM with SAP R/3. Watch for the next blog in this series in the next day or two.


Analytic R&D – Blazing New Trails

Posted on: October 13th, 2015 by DSG Staff No Comments


Vinnie Dessecker, Teradata Data Strategy and Governance Center of ExcellenceBy Vinnie Dessecker, senior consultant, big data – Strategy and Governance Center of Excellence

There is an ever increasing need for businesses to engage in analytic innovation – exploring new, disparate and more data to gain insights to new products, services or other opportunities for an organization and its customers. Analytic innovation is really about seeing where the data takes you – determining in a scientific manner what actions can be predicted based on past performance. Much of the demand for analytic innovation is being driven by the era of big data – the availability of new data sources provides new and previously unimagined insights.

So, what does it really mean to be innovative? What constitutes analytic R&D and how does this discovery capability relate to the other components of the data strategy? Innovation is synonymous with risk taking as an idea or hypothesis must, by definition, be replicable at an economical cost and must satisfy a specific need – stated or unstated. To create an environment where innovation can flourish, an organization must create a culture that encourages exploration and risks, and accepts and actually welcomes failures.

“We made too many wrong mistakes.”

Yogi Berra

As I hike the hills of Southern California, I’m accustomed to following the trail maps, but sometimes on a leisurely hike it’s worthwhile to venture off the beaten path. Frequently, that detour leads to a dead end – perhaps a good place to have a snack, but ultimately I’ve got to get back on the trail. Sometime it leads to the discovery of a beautiful canyon or cave I didn’t know was there, or it could yield a short cut that makes my journey to the top a little faster or more interesting. This exploration is not an instance of deciding to take a hike without a trail map or to venture off the trail or out of the park (remember, there are lots of snakes in those hills)! Rather, it’s a desire to explore the unknown to see what I can discover beyond the obvious or familiar. If I discover a really interesting trail, I might add it to my hiking options and revisit it on a regular basis. If I don’t discover anything new or discover that a deviation leads to something undesirable, I simply won’t do that again and my discovery is at least that valuable to me.

Effective Analytic R&D

This is a very exciting time for analytics – the landscape is changing and evolving every day. Dynamic changes and big challenges exist for all organizations. We should strive not just for innovation, but business agility – the ability to make data-driven decisions faster and with more confidence. Ultimately, the goal is to increase revenue and decrease costs, while meeting the customer’s ever-increasing demands for personalized products and services.

In business, an analytic R&D environment is a key component of innovation. An analytic R&D environment supports rapid experimentation and evaluation of data with less formality of data management rules than applied to the production analytics. While this is a discovery zone, and by its nature meant to be less restricted by rules, the key to success is to apply the right amount of governance and structure. No matter what spontaneous choices I make on the hiking trail, I don’t disregard the basic rules of safety, environmental responsibility or common sense.

Effective R&D data management and governance practices allow for exploration but strive to create order from the chaos that can ensue and drive a culture of innovation. These practices consider the iterative and explorative nature of research and development, understanding that new discoveries are sometimes born from previously “failed” endeavors. In fact, the failures are required to develop the new insights and hypotheses – how many attempts did Thomas Edison make before he developed the light bulb?

An effective data strategy has a path for both production and R&D analytics. And, when the R&D effort yields gold, there must be a path back to the production environments and a way to incorporate the innovation into the pipeline of projects that make up the production portfolio. For instance, identify the business processes that need to be modified and the individuals that should be trained to make appropriate use of the data driven insight.

How do you strike a balance between too much control and too little? The goal should always be to preserve the value of the data and ensure that the customers (internal and external) have confidence in the data irrespective of the source or data type. Some of the questions that must be answered include:

  • Infrastructure and Platform – can the data be accessed irrespective of where it is stored? Can data from different platforms be analyzed without arduous and time-consuming data integration efforts?
  • Data Architecture – is the data (including unstructured and semi-structured data) understood within the context of the broader enterprise and the supported business initiatives? Has the data been modeled? Is it loosely coupled or tightly coupled data?
  • Data Quality – is the data fit for purpose? Can you measure and report on that data quality? For certain data types, is there a lower acceptable quality standard; e.g., social media records.
  • Master Data – does the data need to be mastered; i.e., a single “golden record” created? Does the data need to be combined with mastered data; e.g., is it necessary to integrate social media data with a customer record to analyze customer satisfaction?
  • Metadata – is it possible to report on data lineage, describing where the data originated and how it was transformed in its journey to analytics? How much definition needs to be applied to the data to facilitate effective self-service?
  • Data Integration – does the data used for R&D need to be integrated? Is the requirement for batch or real-time, or something in between? Can it be integrated at the time of the analytics; e.g., dynamically modeled? Is self-provisioning an option?
  • Data Security and Privacy – how much data security is required? Do the same privacy rules apply as those in the production analytics environments? How damaging is a data breach to the organization?
  • Program and Project Management – are there ways to fund and measure the R&D projects that are consistent with the goals for the program and the business initiatives supported? Are there appropriate gating processes in place; i.e., when do you know that the hypotheses is not providing the business value anticipated? How can you build on the previous “failures” when appropriate – does that include sharing the hypotheses, the data, or the techniques applied?

There are new data sources, new technologies, and new skills being developed to exploit these opportunities. But, as with most changes that we have seen over the last 30 years, the answer to addressing the opportunity leads back to traditional concepts and topics. We don’t simply throw out everything we have learned over the years and start again with each new technological advance. That would be a little like discovering a new potential path to the top of the hill and deciding that going forward we didn’t need the same things we used previously to climb the hill – throw out the shoes, the trail map, the water! Throw out the preparation and planning and production quality processes – just start moving! That’s not innovation or business agility, and it’s certainly not progress.

Innovation can flourish when we understand our data strategy – our vision for the organization required to meet the business initiatives – and apply the appropriate management and governance controls, building on what we know works, and leveraging new techniques and technologies.

Vinnie Dessecker, Teradata Data Strategy and Governance Center of ExcellenceVinnie Dessecker is a senior consultant for big data within Teradata’s Strategy and Governance Center of Excellence. Her work aligns business and technical goals for data and information initiatives, including master data management, metadata management, data quality, content/document management, and the analytic roadmap and business intelligence ecosystems.

A Vision for an Analytic Infrastructure

Posted on: October 12th, 2015 by Guest Blogger No Comments


by Dan Woods

An analytic infrastructure can be much like Mark Twain’s definition of a classic: “A book which people praise and don’t read.” An analytics platform is often referred to but rarely architected. Business analysts and data scientists often talk about the power of analytics without talking about the end game. But to make any progress in the big data world – and remain competitive – companies must change the way they think about analytics and implement an analytic infrastructure.

The current approach to big data analytics is simply unsustainable. For each business question that arises, IT builds a custom application. This application centric approach results in many silos modeled after the operational source. Users can get answers against that silo’s set of data, but they can’t get answers from data across multiple platforms. As a result, data must be constantly moved in and out of the applications, and each application must be maintained.

It behooves you to think about what you want to achieve with analytics. Most companies today want to become data-driven organizations. In order to do so, however, analytics must be scalable and sustainable so that every department has access to the information it needs to make decisions based on data. An application centric approach is neither scalable nor sustainable. So how can analytics be made more productive for everyone involved and enabled to scale across the entire organization? Instead of hardwiring analytics into an application, you need to find a way to:

  • Apply the right analytic to the right data integration type. Instead of building an application that is essentially a black box, we need a platform that can reach out for all the needed data and then apply the analytic. This approach minimizes data movement and data duplication.
  • Leverage multiple analytic techniques to get insights. You need to build applications in which data is loosely coupled, thereby creating just enough structure to answer frequently asked questions while expanding access to analytics across the organization.
  • Provide self-service analytics for all skill levels. R programming shouldn’t be a requirement for performing analytics. You need an analytics platform that supports a spectrum of users, from data scientists to business analysts.

The key to enabling these objectives is to make data reusable so that it can be available to as many analytics processes as possible. That means proactively thinking about whether a piece of data will be needed to answer more than one question in the future and understanding where you’re at in your big data journey. You can’t assume that all data will go into a tightly controlled model, like a data warehouse. If you model data using different types of integration based on what you understand about that data, you can create a foundation so that next time you need to answer a question with that data, you can more easily create it.

In the past, companies had a tendency to over-model and over-integrate their data. Not only was this a waste of money, but it led to an architecture that was difficult to change. Today, companies have the opposite problem: under-modeling and under-integrating. This increases both costs and complexity. A better approach is to invest in tightly coupled integration for high-value data that will be used at scale. Keep other data, of varying levels of maturity, either loosely coupled or non-coupled.

Taking this approach to building an analytic infrastructure will help you:

  • Meet new needs faster. As the infrastructure grows, the “nervous system” will become more powerful and more easily adapted to meet new needs.
  • Decrease the cost and complexity of the infrastructure. Avoiding application centric silos will reduce the cost and complexity of analytics.
  • Increase productivity. By investing time upfront to make analytics easier for various skill sets, more people can benefit. In addition, new data can be integrated at a lower cost since time and money are not being wasted over-modeling data that will not be used.

This is a new way of thinking about data, and it may be foreign to many companies. But it is a solid vision for something rarely spoken about but which is necessary for becoming a data-driven organization – an analytic infrastructure. If you’re serious about analytics, then it’s worth working with an experienced advisor who can help you make such an infrastructure a reality.

Dan-Woods Data Points TeradataDan Woods is CTO and founder of CITO Research. He has written more than 20 books about the strategic intersection of business and technology. Dan writes about data science, cloud computing, mobility, and IT management in articles, books, and blogs, as well as in his popular column on


News: Teradata Database on AWS

Posted on: October 7th, 2015 by Guest Blogger No Comments


I’m very excited about Teradata’s latest cloud announcement – that Teradata Database, the market’s leading data warehousing and analytic solution, is being made available for cloud deployment on Amazon Web Services (AWS) to support production workloads. Teradata Database on AWS will be offered on a variety of multi-terabyte EC2 (Elastic Cloud Compute) instances in supported AWS regions via a listing in the AWS Marketplace.

I believe the news is significant because it illustrates a fundamentally new deployment option for Teradata Database, which has long been the industry’s most respected engine for production analytics. By incorporating AWS as the first public cloud offering for deploying a production Teradata Database, we will make it easier for companies of all sizes to become data-driven with best-in-class data warehousing and analytics.

Teradata Database on AWS will be quite different than what we currently offer. It will be the first time that Teradata Database is being offered for production workloads on the public cloud. Previous products such as Teradata Express and Teradata Virtual Machine Edition have been positioned for test and development or sandbox use cases. This is also the first time that Teradata Database has been optimized for the AWS environment; previously Teradata Virtual Machine Edition was only optimized for the VMware virtualized environment.

Fear not: Teradata Cloud, our purpose-built managed environment, will continue to evolve, thrive, and prosper alongside Teradata Database on AWS. Customers of all sizes and verticals – both existing and new – find great value and convenience in Teradata Cloud for a variety of use cases, such as production analytics, test and development, quality assurance, data marts, and disaster recovery. In fact, we recently announced that Core Digital Media is using Teradata Cloud for DR.

So who is the intended customer base for Teradata Database on AWS? Well, literally any organization of any size can use and benefit because it combines the power of Teradata with the convenience of AWS for a win-win value proposition. I think large, existing Teradata integrated data warehouse and Teradata appliance customers will find it just as flexible and capable as small, new-to-Teradata customers. It opens up a whole new market in terms of accessibility and deployment flexibility.

So one might ask: Why has Teradata not offered this product before now? Frankly, cloud computing has evolved substantially over the past few years in terms of convenience, security, performance, and market adoption. We see substantial opportunity to provide more deployment options than previously existed as well as expand the addressable market for production data warehousing and analytics.

And, while there are many other companies that offer database, data warehouse, and analytic products, none has the unrivaled track record of market leadership that Teradata has, which is in a league of its own. The key thing to know is that this is Teradata Database, the market’s leading data warehouse and analytic engine for over three decades. The same software and the same winning DNA that powers our existing portfolio of products is what drives this instantiation too.

Teradata Database on AWS will be available in Q1 2016 for global deployment. It will be offered on a variety of EC2 instance types up to multiple terabytes via a listing in the AWS Marketplace. Customers will have the ability to deploy standalone on AWS or can complement both on-premises and Teradata Cloud environments.

I think it’s going to be great, and I hope you do too. Feel free to let me know if you have any questions about our latest cloud innovation.

Brian Wood

Brian Wood is director of cloud marketing at Teradata. He is a knowledgeable technology marketing executive with 15+ years of digital, lead gen, sales / marketing operations & team leadership success. He has an MS in Engineering Management from Stanford University, a BS in Electrical Engineering from Cornell University, and served as an F-14 Radar Intercept Officer in the US Navy.


Why Should Big Data Matter to You?

Posted on: September 15th, 2015 by Marc Clark No Comments


With all the attention given to big data, it is no surprise that more companies feel pressure to explore the possibilities for themselves. The challenge for many has been the high barriers to entry. Put simply, big data has cost big bucks. Maybe even more perplexing has been uncertainty about just what big data might deliver for a given company. How do you know if big data matters to your business?

The advent of cloud-based data warehouse and analytics systems can eliminate much of that uncertainty. For the first time, it is possible to explore the value proposition of big data without the danger of drowning the business in the costs and expertise needed to get big data infrastructure up and running.

cloud analytics Marc Clark Teradata

Subscription-based models replace the need to purchase expensive hardware and software with the possibility of a one-stop-shopping experience where everything—from data integration and modeling tools to security, maintenance and support—is available as a service. Best of all, the cloud makes it feasible to evaluate big data regardless of whether your infrastructure is large and well-established with a robust data warehouse, or virtually nonexistent and dependent on numerous Excel worksheets for analysis.

Relying on a cloud analytics solution to get you started lets your company test use cases, find what works best, and grow at its own pace.

Why Big Data May Matter

Without the risk and commitment of building out your own big data infrastructure, your organization is free to explore the more fundamental question of how your data can influence your business. To figure out if big data analytics matters to you, ask yourself and your company a few questions:

  • Are you able to take advantage of the data available to you in tangible ways that impact your business?
  • Can you get answers quickly to questions about your business?
  • Is your current data environment well integrated, or a convoluted and expensive headache?

For many organizations, the answer to one or more of these questions is almost certainly a sore point. This is where cloud analytics offers alternatives, giving you the opportunity to galvanize operations around data instead of treating data and your day-to-day business as two separate things. The ultimate promise of big data is not one massive insight that changes everything. The goal is to create a ceaseless conveyor belt of insights that impact decisions, strategies, and practices up, down, and across the operational matrix.

The Agile Philosophy for Cloud Analytics

We use the word agile a lot, and cloud analytics embraces that philosophy in important new ways. In the past, companies have invested a lot of time, effort, and money in building infrastructure to integrate their data and create models. Then they find themselves trapped in an environment that doesn’t suit their requirements.

Cloud analytics provides a significant new path. It's a manageable approach that enables companies to get to important questions without bogging down in technology.

And, to really figure out what value is lurking in their data and what its impact might be.

To learn more, download  our free Enterprise Analytics in the Cloud eBook.

Big Data Success Starts With Empowerment: Learn Why and How

Posted on: September 1st, 2015 by Chris Twogood No Comments


As my colleague Bill Franks recently pointed out on his blog, there is often the perception that being data-driven is all about technology. While technology is indeed important, being data-driven actually spans a lot of different areas, including people, big data processes, access, a data-driven culture and more. In order to be successful with big data and analytics, companies need to fundamentally embed it into their DNA.

To be blunt, that level of commitment simply must stem from the top rungs of any organization. This was evident when Teradata recently surveyed 316 senior data and IT executives. The commitment to big data was far more apparent at companies where CEOs personally focus on big data initiatives, as over half of those respondents indicated it as the single most important way to gain a competitive advantage.

Big Data Success Starts With Empowerment, Chris Twogood, Data Points, TeradataIndeed, industries with the most competitive environments are the ones leading the analytics push. These companies simply must find improvements, even if the needle is only being moved in the single digits with regards to things like operational costs and revenue.

Those improvements don’t happen without proper leadership, especially since a data-driven focus impacts just about all facets of the business -- from experimentation to decision-making to rewarding employees. Employees must have access to big data, feel empowered with regards to applying it and be confident in their data-driven decisions.

In organizations where being data-driven isn’t embedded in the DNA, someone may go make a decision and attempt to leverage a little data. But, if they don’t feel empowered by the data’s prospects and aren’t confident in the data, they will spend a lot of cycles seeking validation. A lot of time will be spent simply attempting to ensure they have the right data, the accurate data, that they are actually making the right decision based on it and that they will be backed up once that decision is made.

There is a lot of nuance with regards to being data-driven, of course. While all data has value, there are lots of levels to that value – the challenge generally lies in recognizing the values and extracting it. Our survey confirmed, for instance, just how hot location data is right now, as organization work to understand the navigation of their customers in order to deliver relevant communication.

Other applications of data, according to the survey, include the creation of new business models, the discovery of new product offers, and the monetization of data to external companies. But that’s just the tip of the iceberg. Healthcare, for example, is an up-and-coming industry with regards to data usage. An example is better understanding path to surgery -- breaking down the four or five steps most important to achieving a better patient outcome.

But whether you’re working in a hospital or a hot startup, and working to carve out more market share or improve outcomes for patients, the fundamentals we’ve been discussing here remain the same. Users must be empowered and confident in order to truly be data-driven -- and they’re not going to feel that way unless those at the top are leading the way.



By Imad Birouty, Teradata Product Marketing Manager

In-memory database processing is a hot topic in the market today. It promises to bring high performance to OLTP and Data Warehouse environments.  As such, many vendors are working hard to develop in-memory database technology.

Memory is fast, but still expensive when compared to disk storage. As such, it should be treated as a precious resource and used wisely for the best return on your investment.

Teradata Intelligent Memory  does just that. Through advanced engineering techniques, the Teradata Database automatically places the most frequently accessed data in memory, delivering in-memory performance with the cost economics of disk storage. The 80/20 rule and proven real-world data warehouse usage patterns shows that a small percentage of the data accounts for the vast majority of data access. Teradata Database’s unique multi-temperature data management infrastructure makes it possible to leverage this and keep only the most frequently used data in memory to achieve in-memory performance for the entire database. This is cutting-edge technology and does not require a separate dedicated in-memory database to manage. And because it's built into the Teradata Database, companies get the scalability, manageability, and robust features associated with the Teradata Database.

Forrester Research just released their inaugural Wave dedicated to in-memory:  The Forrester Wave™: In-Memory Database Platforms, Q3 2015 evaluation, naming Teradata a leader. Teradata has always been a pioneer in scalable, disk-based, shared-nothing RDBMS.  Because it has continued to evolve, change, and incorporate the latest technologies, the Teradata Database is now a leader in in-memory database processing too.

While the Forrester Wave evaluated Teradata Database 15.0., we are even more excited about Teradata Database 15.10 which utilizes even more advanced in-memory techniques that are integrated into the Teradata Database. New in-memory accelerators such as pipelining, vectorization, bulk qualification, and columnar storage are integrated into the Teradata Database and bring in-memory performance to all data in the warehouse, including multi-structured data types such as JSON and weblogs which are associated with Big Data.

A free copy of the Forrester Wave report is available here, as well as today’s news release here. 

We’ll be announcing availability of Teradata Database 15.10 in a few weeks, so look for that announcement.


Why We Love Presto

Posted on: June 24th, 2015 by Daniel Abadi No Comments


Concurrent with acquiring Hadoop companies Hadapt and Revelytix last year, Teradata opened the Teradata Center for Hadoop in Boston. Teradata recently announced that a major new initiative of this Hadoop development center will include open-source contributions to a distributed SQL query engine called Presto. Presto was originally developed at Facebook, and is designed to run high performance, interactive queries against Big Data wherever it may live --- Hadoop, Cassandra, or traditional relational database systems.

Among those people who will be part of this initiative and contributing code to Presto include a subset of the Hadapt team that joined Teradata last year. In the following, we will dive deeper into the thinking behind this new initiative from the perspective of the Hadapt team. It is important to note upfront that Teradata’s interest in Presto, and the people contributing to the Presto codebase, extends beyond the Hadapt team that joined Teradata last year. Nonetheless, it is worthwhile to understand the technical reasoning behind the embrace of Presto from Teradata, even if it presents a localized view of the overall initiative.

Around seven years ago, Ashish Thusoo and his team at Facebook built the first SQL layer over Hadoop as part of a project called Hive. At its essence, Hive was a query translation layer over Hadoop: it received queries in a SQL-like language called Hive-QL, and transformed them into a set of MapReduce jobs over data stored in HDFS on a Hadoop cluster. Hive was truly the first project of its kind. However, since its focus was on query translation into the existing MapReduce query execution engine of Hadoop, it achieved tremendous scalability, but poor efficiency and performance, and ultimately led to a series of subsequent SQL-on-Hadoop solutions that claimed 100X speed-ups over Hive.

Hadapt was the first such SQL-on-Hadoop solution that claimed a 100X speed-up over Hive on certain types of queries. Hadapt was spun out of the HadoopDB research project from my team at Yale and was founded by a group of Yale graduates. The basic idea was to develop a hybrid system that is able to achieve the fault-tolerant scalability of the Hive MapReduce query execution engine while leveraging techniques from the parallel database system community to achieve high performance query processing.

The intention of HadoopDB/Hadapt was never to build its own query execution layer. The first version of Hadapt used a combination of PostgreSQL and MapReduce for distributed query execution. In particular, the query operators that could be run locally, without reliance on data located on other nodes in the cluster, were run using PostgreSQL’s query operator set (although Hadapt was written such that PostgreSQL could be replaced by any performant single-node database system). Meanwhile, query operators that required data exchange between multiple nodes in the cluster were run using Hadoop’s MapReduce engine.

Although Hadapt was 100X faster than Hive for long, complicated queries that involved hundreds of nodes, its reliance on Hadoop MapReduce for parts of query execution precluded sub-second response time for small, simple queries. Therefore, in 2012, Hadapt started to build a secondary query execution engine called “IQ” which was intended to be used for smaller queries. The idea was that all queries would be fed through a query-analyzer layer before execution. If the query was predicted to be long and complex, it would be fed through Hadapt’s original fault-tolerant MapReduce-based engine. However, if the query would complete in a few seconds or less, it would be fed to the IQ execution engine.

presto graphic blogIn 2013 Hadapt integrated IQ with Apache Tez in order avoid redundant programming efforts, since the primary goals of IQ and Tez were aligned. In particular, Tez was designed as an alternative to MapReduce that can achieve interactive performance for general data processing applications. Indeed, Hadapt was able to achieve interactive performance on a much wider-range of queries when leveraging Tez, than what it was able to achieve previously.

Figure 1: Intertwined Histories of SQL-on-Hadoop Technology

Unfortunately Tez was not quite a perfect fit as a query execution engine for Hadapt’s needs. The largest issue was that before shipping data over the network during distributed operators, Tez first writes this data to local disk. The overhead of writing this data to disk (especially when the size of the intermediate result set was large) precluded interactivity for a non-trivial subset of Hadapt’s query workload. A second problem is that the Hive query operators that are implemented over Tez use (by default) traditional Volcano-style row-by-row iteration. In other words, a single function-invocation for a query operator would process just a single database record. This resulted in a larger number of function calls required to process a large dataset, and poor instruction cache locality as the instructions associated with a particular operator were repeatedly reloaded into the instruction cache for each function invocation. Although Hive and Tez have started to alleviate this issue with the recent introduction of vectorized operators, Hadapt still found that query plans involving joins or SQL functions would fall back to row-by-row iteration.

The Hadapt team therefore decided to refocus its query execution strategy (for the interactive query part of Hadapt’s engine) to Presto, which presented several advantages over Tez. First, Presto pipelines data between distributed query operators directly, without writing to local disk, significantly improving performance for network-intensive queries. Second, Presto query operators are vectorized by default, thereby improving CPU efficiency and instruction cache locality. Third, Presto dynamically compiles selective query operators to byte code, which lets the JVM optimize and generate native machine code. Fourth, it uses direct memory management, thereby avoiding Java object allocations, its heap memory overhead and garbage collection pauses. Overall, Presto is a very advanced piece of software, and very much in line with Hadapt’s goal of leveraging as many techniques from modern parallel database system architecture as possible.

The Teradata Center for Hadoop has thus fully embraced Presto as the core part of its technology strategy for the execution of interactive queries over Hadoop. Consequently, it made logical sense for Teradata to take its involvement in the Presto to the next level. Furthermore, Hadoop is fundamentally an open source project, and in order to become a significant player in the Hadoop ecosystem, Teradata needs to contribute meaningful and important code to the open source community. Teradata’s recent acquisition of Think Big serves as further motivation for such contributions.

Therefore Teradata has announced that it is committed to making open source contributions to Presto, and has allocated substantial resources to doing so. Presto is already used by Silicon Valley stalwarts Facebook, AirBnB, NetFlix, DropBox, and Groupon. However, Presto’s enterprise adoption outside of silicon valley remains small. Part of the reason for this is that ease-of-use and enterprise features that are typically associated with modern commercial database systems are not fully available with Presto. Missing are an out-of the-box simple-to-use installer, database monitoring and administration tools, and third-party integrations. Therefore, Teradata’s initial contributions will focus in these areas, with the goal of bridging the gap to getting Presto widely deployed in traditional enterprise applications. This will hopefully lead to more contributors and momentum for Presto.

For now, Teradata’s new commitments to open source contributions in the Hadoop ecosystem are focused on Presto. Teradata’s commitment to Presto and its commitment to making meaningful contributions to an open source project is an exciting development. It will likely have a significant impact on enterprise-adoption of Presto. Hopefully, Presto will become a widely used open source parallel query execution engine --- not just within the Hadoop community, but due to the generality of its design and its storage layer agnosticism, for relational data stored anywhere.


Learn more or download Presto now.


daniel abadi crop BLOG bio mgmtDaniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). Follow Daniel on Twitter @Daniel_Abadi