big data

News: Teradata Database on AWS

Posted on: October 7th, 2015 by Guest Blogger No Comments


I’m very excited about Teradata’s latest cloud announcement – that Teradata Database, the market’s leading data warehousing and analytic solution, is being made available for cloud deployment on Amazon Web Services (AWS) to support production workloads. Teradata Database on AWS will be offered on a variety of multi-terabyte EC2 (Elastic Cloud Compute) instances in supported AWS regions via a listing in the AWS Marketplace.

I believe the news is significant because it illustrates a fundamentally new deployment option for Teradata Database, which has long been the industry’s most respected engine for production analytics. By incorporating AWS as the first public cloud offering for deploying a production Teradata Database, we will make it easier for companies of all sizes to become data-driven with best-in-class data warehousing and analytics.

Teradata Database on AWS will be quite different than what we currently offer. It will be the first time that Teradata Database is being offered for production workloads on the public cloud. Previous products such as Teradata Express and Teradata Virtual Machine Edition have been positioned for test and development or sandbox use cases. This is also the first time that Teradata Database has been optimized for the AWS environment; previously Teradata Virtual Machine Edition was only optimized for the VMware virtualized environment.

Fear not: Teradata Cloud, our purpose-built managed environment, will continue to evolve, thrive, and prosper alongside Teradata Database on AWS. Customers of all sizes and verticals – both existing and new – find great value and convenience in Teradata Cloud for a variety of use cases, such as production analytics, test and development, quality assurance, data marts, and disaster recovery. In fact, we recently announced that Core Digital Media is using Teradata Cloud for DR.

So who is the intended customer base for Teradata Database on AWS? Well, literally any organization of any size can use and benefit because it combines the power of Teradata with the convenience of AWS for a win-win value proposition. I think large, existing Teradata integrated data warehouse and Teradata appliance customers will find it just as flexible and capable as small, new-to-Teradata customers. It opens up a whole new market in terms of accessibility and deployment flexibility.

So one might ask: Why has Teradata not offered this product before now? Frankly, cloud computing has evolved substantially over the past few years in terms of convenience, security, performance, and market adoption. We see substantial opportunity to provide more deployment options than previously existed as well as expand the addressable market for production data warehousing and analytics.

And, while there are many other companies that offer database, data warehouse, and analytic products, none has the unrivaled track record of market leadership that Teradata has, which is in a league of its own. The key thing to know is that this is Teradata Database, the market’s leading data warehouse and analytic engine for over three decades. The same software and the same winning DNA that powers our existing portfolio of products is what drives this instantiation too.

Teradata Database on AWS will be available in Q1 2016 for global deployment. It will be offered on a variety of EC2 instance types up to multiple terabytes via a listing in the AWS Marketplace. Customers will have the ability to deploy standalone on AWS or can complement both on-premises and Teradata Cloud environments.

I think it’s going to be great, and I hope you do too. Feel free to let me know if you have any questions about our latest cloud innovation.

Brian Wood

Brian Wood is director of cloud marketing at Teradata. He is a knowledgeable technology marketing executive with 15+ years of digital, lead gen, sales / marketing operations & team leadership success. He has an MS in Engineering Management from Stanford University, a BS in Electrical Engineering from Cornell University, and served as an F-14 Radar Intercept Officer in the US Navy.


Where Data Lives

Posted on: October 1st, 2015 by Data Analytics Staff No Comments


By: Lorie Nelson, Senior Product Manager for Teradata’s Travel and Hospitality Data Model

As a child I was sure there must be a book of life that explains what we need to know about each other, the planet and our purpose in the world. I asked every adult I knew or came in contact with if they knew of such a book but no one had an answer.

Initially, I was disappointed until the idea came to me, “I will write my own book.” And, this began my first data collection project - the beginning of my fascination with data.

Lorie Nelson Teradata Data ModelingI started my project by taking notes about people I knew. I explored my mother’s desk, pulled out some index cards and started to write.

I had no idea what I was going to do with the data. But I was sure that if I just kept at it, I would recognize some clue or pattern… the key to understanding life, unlocking the secret object of my desire. Eventually I stopped collecting data on notecards to pursue other interests, but continued to collect data, as we all do, through personal experience.

The Data Living in Our Brains

Our brains are essentially “big data” platforms with their own unique wiring, algorithms if you will, for data collection, storage and retrieval. An article in Scientific American tells us, “...the brain’s memory storage capacity is something closer to around 2.5 petabytes...” The data are a combination of structured and unstructured data. Two-thirds of our brain is set to process visual data while the remainder is used to process our thoughts, perceptions, textural input and output etc. The amazing and curious thing is how all of this data comes together to make up our individual stories. Our personal stories also contain information gathered from our network of family, friends and affiliated communities. We are much smarter collectively than we are as individuals.

When we have a question or problem, our process is to examine our personal data collections, make discernments about the validity of that data and hopefully, find an answer to our problem. But, if we only search our own mind’s collection, the solution is often obscured. Sometimes the solution requires a “refresh” from our smart collective network of friends and colleagues, and we may also reach beyond our personal network to Google a question or post our problem on a group board like LinkedIn.

The Value of Going Beyond Our Own Data

This is exactly what innovative companies such as Amazon, Google, Microsoft, eBay and The New York Times, to name a few, have been doing. They are tapping into their own data stores and, with the help of data scientists and data artists, they are beginning to understand data in new ways by co-mingling their data with the vast stores of data from other sources outside of their own collections.

Businesses can discover valuable insight into their product development, delivery and marketing, for example, by combining their internal data with public data to determine sentiment from sources such as Twitter, Facebook, Yelp and Google Alerts.

Earlier this year, I landed on the Harvard Web site and discovered a treasure of knowledge and the object of desire for my inner 7-year-old. It was the Harvard Dataverse!

The Harvard Dataverse Project is an open source web application developed by the Data Science team at Harvard’s Institute for Quantitative Social Science (IQSS) and is dedicated to sharing, archiving, citing, exploring and analyzing research data across all research fields. Coding the Dataverse Network software began in 2006. “The Dataverse repository hosts multiple dataverses.” Datasets in each Dataverse contain descriptive metadata and data files (including documentation and code). They open doors to researchers, writers, publishers and affiliated institutions. Other universities and research institutions around the world have joined forces with Harvard and are creating their own dataverses.

A few years ago, while attending a data visualization workshop, my instructor, Jer Thorp, introduced me to another great source of data, The New York Times. Their database offers up over 50 years of articles, searchable to those who apply and use the NYT data structure format for submitting queries. Utilizing a programming language called Processing,  I transformed my NYT result dataset into a beautiful radial diagram that showed me the concentration of my search phrase over time by as it related to certain keywords. What was more interesting to me through this visualization, and unlike the typical bar charts and pie charts we have all seen, was my ability to see the outliers as well. Sometimes it is the outlier that tells a more interesting story than the larger concentrations of occurrences.

Advances in the areas of big data and analytics coupled with existing technologies and access to internal and external data will contribute to exponential growth, true innovation and creativity in solving business and scientific questions. As Hans Rosling said, “let the dataset change your mindset.”

In my next blog, “The Data You Weren’t Looking For,” I will expand upon this topic to cover innovations in data discoveries and visualizations by current data scientists and data artists.


Lorie Nelson Teradata Data Points blogLorie Nelson is the Senior Product Manager for Teradata’s Travel and Hospitality Data Model. She describes herself as an artist-in-resistance with a passion for data.


Teradata Uses Open Source to Expand Access to Big Data for the Enterprise

Posted on: September 30th, 2015 by Data Analytics Staff No Comments


By Mark Shainman, Global Program Director, Competitive Programs

Teradata’s announcement of the accelerated release of enterprise-grade ODBC/JDBC drivers for Presto opens up an ocean of big data on Hadoop to the existing SQL-based infrastructure. For companies seeking to add big data to their analytical mix, easy access through Presto can solve a variety of problems that have slowed big data adoption. It also opens up new ways of querying data that were not possible with some other SQL on Hadoop tools. Here’s why.

One of the big questions facing those who toil to create business value out of data is how the worlds of SQL and big data come together. After the first wave of excitement about the power of Hadoop, the community quickly realized that because of SQL’s deep and wide adoption, Hadoop must speak SQL. And so the race began. Hive was first out of the gate, followed by Impala and many others. The goal of all of these initiatives was to make the repository of big data that was growing inside Hadoop accessible through SQL or SQL-like languages.

In the fall of 2012, Facebook determined that none of these solutions would meet its needs. Facebook created Presto as a high-performance way to run SQL queries against data in Hadoop. By 2013, Presto was in production and released as open source in November of that year.

In 2013, Facebook found that Presto was faster than Hive/MapReduce for certain workloads, although there are many efforts underway in the Hive community to increase its speed. Facebook achieved these gains by bypassing the conventional MapReduce programming paradigm and creating a way to interact with data in HDFS, the Hadoop file system, directly. This and other optimizations at the Java Virtual Machine level allow Presto not only to execute queries faster, but also to use other stores for data. This extensibility allows Presto to query data stored in Cassandra, MySQL, or other repositories. In other words, Presto can become a query aggregation point, that is, a query processor that can bring data from many repositories together in one query.

In June 2015, Teradata announced a full embrace of Presto. Teradata would add developers to the project, add missing features both as open source and as proprietary extensions, and provide enterprise-grade support. This move was the next step in Teradata’s effort to bring open source into its ecosystem. The Teradata Unified Data Architecture provides a model for how traditional data warehouses and big data repositories can work together. Teradata has supported integration of open source first through partnerships with open source Hadoop vendors such as Hortonworks, Cloudera, and MapR, and now through participation in an ongoing open source project.

Teradata’s embrace of Presto provided its customers with a powerful combination. Through Teradata QueryGrid, analysts can use the Teradata Data Warehouse as a query aggregation point and gather data from Hadoop systems, other SQL systems, and Presto. The queries in Presto can aggregate data from Hadoop, but also from Cassandra and other systems. This is a powerful capability that enables Teradata’s Unified Data Architecture to enable data access across a broad spectrum of big data platforms.

To provide Presto support for mainstream BI tools required two things: ANSI SQL support and ODBC/JDBC drivers. Much of the world of BI access works through BI toolsets that understand ANSI SQL. A tool like QlikView, MicroStrategy, or Tableau allows a user to easily query large datasets as well as visualize the data without having to hand-write SQL statements, opening up the world of data access and data analysis to a larger number of users. Having robust BI tool support is critical for broader adoption of Presto within the enterprise.

For this reason, ANSI SQL support is crucial to making the integration and use of BI tools easy. Many of the other SQL on Hadoop projects are limited in SQL support or utilize proprietary SQL “like” languages. Presto is not one of them. To meet the needs of Facebook, SQL support had to be strong and conform to ANSI standards, and Teradata’s joining the project will make the scope and support of SQL by Presto stronger still.

The main way that BI tools connect and interact with databases and query engines is through ODBC/JDBC drivers. For the tools to communicate well and perform well, these drivers have to be solid and enterprise class. That’s what yesterday’s announcement is all about.

Teradata has listened to the needs of the Presto community and accelerated its plans for adding enterprise-grade ODBC/JDBC support to Presto. In December, Teradata will make available a free, enterprise class, fully supported ODBC driver, with a JDBC driver to follow in Q1 2016. Both will be available for download on

With ODBC/JDBC drivers in place and the ANSI SQL support that Presto offers, anyone using modern BI tools can access data in Hadoop through Presto. Of course, certification of the tools will be necessary for full functionality to be available, but with the drivers in place, access is possible. Existing users of Presto, such as Netflix, are extremely happy with the announcement. As Kurt Brown, Director, Data Platform at Netflix put it, “Presto is a key technology in the Netflix big data platform. One big challenge has been the absence of enterprise-grade ODBC and JDBC drivers. We think it’s great that Teradata has decided to accelerate their plans and deliver this feature this year.”


By Vinnie Dessecker, senior consultant, big data – Strategy and Governance Center of Excellence

What constitutes a good data strategy? Is a well thought-out and articulated data strategy really relevant in this rapidly changing big data environment? Or, is a data strategy too easily rendered obsolete?

A data strategy is a vision, one that builds the foundation for a business to organize and manage all of its data to provide the maximum value to the organization. A data strategy is best accomplished through a roadmap. The roadmap should embody the plan to leverage the data that is available to a company to provide a competitive advantage – to build a data-driven organization.

I love to hike the hills of Southern California. The data strategy vision reminds me of my hiking strategy. I have a vision of where I want to go and I look up periodically to see where the top of the hill is, but most of the time I’m looking 10 to 12 steps ahead, focusing on steady progress. Sometimes I have to adjust my path, but generally I stay on the trail. (Especially in the spring when snakes are everywhere!) If I forget that I’m headed for the top of the hill or forget to check occasionally to make sure I’m still walking in that direction, where will I end up? Probably some place I don’t want to be!

Every Journey Starts with Strategy

So, how do we build a good data strategy – a vision for data that isn’t subject to the whims of new technology and the vagaries of big data? How do we establish a vision that unifies and solidifies the collaboration between IT and the business?

“If you don't know where you are going, you'll end up someplace else.”

Yogi Berra

The data strategy becomes reality through the use of a roadmap – a roadmap that is centered on the organization’s business initiatives. Business initiatives drive the enabling and supporting technology capabilities (not the other way around). By understanding business initiatives and needs, solutions can be evaluated to determine the right information, applications and systems that will support those initiatives. Enabling capabilities such as data governance and those associated with enterprise data management (data quality, metadata management, master data management, data architecture, security and privacy, and data integration) are planned and implemented to the degree required to support the business initiatives – no more, no less.

Journey to the Top

The roadmap lays out the plan for all of these capabilities aligned to the business initiatives so that they can be accomplished in an incremental fashion (steady progress). And, as with my hiking, we periodically must look up the hill to ensure we’re headed in the right direction and adjust the roadmap accordingly. Unlike hiking, the business initiatives are subject to change and the roadmap must adjust to accommodate; however, part of the vision should be a solid architecture that is flexible and scalable; one that adjusts to these changes without requiring massive amounts of rework or one that is so brittle that it breaks.

Without a data strategy, the people within an organization have no rules to follow when making decisions that are critical to the success of the organization. The absence of a strategy means that the organization ends up “someplace else” and, most probably; different parts of the organization end up in different, disconnected versions of “someplace else.”

What’s your strategy? Your roadmap?

If you’re still developing your vision or your strategy, I invite you to check out the roadmap services we offer.


Vinnie Dessecker, Teradata Data Strategy and Governance Center of ExcellenceVinnie Dessecker is a senior consultant for big data within Teradata’s Strategy and Governance Center of Excellence. Her work aligns business and technical goals for data and information initiatives, including master data management, metadata management, data quality, content/document management, and the analytic roadmap and business intelligence ecosystems.

Why Should Big Data Matter to You?

Posted on: September 15th, 2015 by Marc Clark No Comments


With all the attention given to big data, it is no surprise that more companies feel pressure to explore the possibilities for themselves. The challenge for many has been the high barriers to entry. Put simply, big data has cost big bucks. Maybe even more perplexing has been uncertainty about just what big data might deliver for a given company. How do you know if big data matters to your business?

The advent of cloud-based data warehouse and analytics systems can eliminate much of that uncertainty. For the first time, it is possible to explore the value proposition of big data without the danger of drowning the business in the costs and expertise needed to get big data infrastructure up and running.

cloud analytics Marc Clark Teradata

Subscription-based models replace the need to purchase expensive hardware and software with the possibility of a one-stop-shopping experience where everything—from data integration and modeling tools to security, maintenance and support—is available as a service. Best of all, the cloud makes it feasible to evaluate big data regardless of whether your infrastructure is large and well-established with a robust data warehouse, or virtually nonexistent and dependent on numerous Excel worksheets for analysis.

Relying on a cloud analytics solution to get you started lets your company test use cases, find what works best, and grow at its own pace.

Why Big Data May Matter

Without the risk and commitment of building out your own big data infrastructure, your organization is free to explore the more fundamental question of how your data can influence your business. To figure out if big data analytics matters to you, ask yourself and your company a few questions:

  • Are you able to take advantage of the data available to you in tangible ways that impact your business?
  • Can you get answers quickly to questions about your business?
  • Is your current data environment well integrated, or a convoluted and expensive headache?

For many organizations, the answer to one or more of these questions is almost certainly a sore point. This is where cloud analytics offers alternatives, giving you the opportunity to galvanize operations around data instead of treating data and your day-to-day business as two separate things. The ultimate promise of big data is not one massive insight that changes everything. The goal is to create a ceaseless conveyor belt of insights that impact decisions, strategies, and practices up, down, and across the operational matrix.

The Agile Philosophy for Cloud Analytics

We use the word agile a lot, and cloud analytics embraces that philosophy in important new ways. In the past, companies have invested a lot of time, effort, and money in building infrastructure to integrate their data and create models. Then they find themselves trapped in an environment that doesn’t suit their requirements.

Cloud analytics provides a significant new path. It's a manageable approach that enables companies to get to important questions without bogging down in technology.

And, to really figure out what value is lurking in their data and what its impact might be.

To learn more, download  our free Enterprise Analytics in the Cloud eBook.

Pluralism and Secularity In a Big Data Ecosystem

Posted on: August 25th, 2015 by Guest Blogger No Comments


Solutions around today's analytic ecosystem are too technically driven without focusing on business values. The buzzwords seem to over-compensate the reality of implementation and cost of ownership. I challenge you to view your analytic architecture using pluralism and secularity. Without such a view of this world your resume will fill out nicely but your business values will suffer.

In my previous role, prior to joining Teradata, I was given the task of trying to move "all" of our organization’s BI data to Hadoop. I will share my approach - how best-in-class solutions come naturally when pluralism and secularity are used to support a business-first environment.

Big data has exposed some great insights into what we can, should, and need to do with our data. However, this space is filled with radical opinions and the pressure to "draw a line in the sand" between time-proven methodologies and what we know as "big data." Some may view these spaces moving in opposite directions; however, these spaces will collide. The question is not "if" but "when." What are we doing now to prepare for this inevitability? Hadapt seems to be moving in the right direction in terms of leadership between the two spaces.

Relational Databases
I found many of the data sets in relational databases to be lacking in structure, highly transient, and loosely coupled. Data scientists needed to have quick access to data sets to perform their hypothesis testing.

Continuously requesting IT to rerun their ETL processes was highly inefficient. A data scientist once asked me "Why can't we just dump the data in a Linux mount for exploration?" Schema-on-write was too restrictive as the data scientists could not predefine the attributes for the data set for ingestion. As the data sets became more complex and unstructured, the ETL processes became exponentially more complicated and performance was hindered.

I also found during this exercise that my traditional BI analysts were perplexed with formulating questions about the data. One of the reasons was that businesses did not know what questions to ask. This is a common challenge in the big data ecosystem. We are used to knowing our data and being able to come up with incredible questions about it. The BI analyst's world has been disrupted as they now need to ask "What insights/answers do I have about my data?" – (according to IIya Katsov in one of his blogs).

The product owner of Hadoop was convinced that the entire dataset should be hosted on Amazon Web Services (S3) which would allow our analytics (via Elastiv Map Reduce) to perform at incredible speeds. However, due to various ISO guidelines, the data sets had to be encrypted at rest and in transit which degraded performance by approximately 30 percent.

Without an access path model, logical model, or unified model, business users and data scientists were left with little appetite for unified analytics. Data scientists were on their own guidelines for integrated/ federated/governed/liberated post-discovery analytical sets.

Communication with the rest of the organization became an unattainable goal. The models which came out of discovery were not federated across the organization as there was a disconnect between the data scientists, data architects, Hadoop engineers, and data stewards -- who spoke different languages. Data scientists were creating amazing predictive models and at the same time data stewards were looking for tools to help them provide insight in prediction for the SAME DATA.

Using NoSQL for a specific question on a dataset required a new collection set. To maintain and govern the numerous collections became a burden. There had to be a better way to answer many questions without having a linear relationship to the number of collections instantiated. The answer may be within access path modeling.

Another challenge I faced was when users wanted a graphical representation of the data and the embedded relationships or lack thereof. Are they asking for a data model? The users would immediately say no, since they read in a blog somewhere that data modeling is not required using NoSQL technology.

At the end of this entire implementation I found myself needing to integrate these various platforms for the sake of providing a business-first solution. Maybe the line in the sand isn't a business-first approach? Those that drive Pluralism (a condition or system in which two or more states, groups, principles, sources of authority, etc., coexist) and Secularity (not being devoted to a specific technology or data 'religion') within their analytic ecosystem -- can truly deliver a business-first solution approach while avoiding the proverbial "silver bullet" architecture solutions.

In my coming post, I will share some of the practices for access path modeling within Big Data and how it supports pluralism and secularity within a business-first analytic ecosystem.

Sunile Manjee

Sunile Manjee is a Product Manager in Teradata’s Architecture and Modeling Solutions team. Big Data solutions are his specialty, along with the architecture to support a unified data vision. He has over 12 years of IT experience as a Big Data architect, DW architect, application architect, IT team lead, and 3gl/4gl programmer.

Optimization in Data Modeling 1 – Primary Index Selection

Posted on: July 14th, 2015 by Guest Blogger No Comments


In my last blog I spoke about the decisions that must be made when transforming an Industry Data Model (iDM) from Logical Data Model (LDM) to an implementable Physical Data Model (PDM). However, being able to generate DDL (Data Definition Language) that will run on a Teradata platform is not enough – you also want it to perform well. While it is possible to generate DDL almost immediately from a Teradata iDM, each customer’s needs mandate that existing structures be reviewed against data and access demographics, so that optimal performance can be achieved.

Having detailed data and access path demographics during PDM design is critical to achieving great performance immediately, otherwise it’s simply guesswork. Alas, these are almost never available at the beginning of an installation, but that doesn’t mean you can’t make “excellent guesses.”

The single most influential factor in achieving PDM performance is proper Primary Index (PI) selection for warehouse tables. Data modelers are focused on entity/table Primary Keys (PK) since it is what defines uniqueness at the row level. Because of this, a lot of physical modelers tend to implement the PK as a Unique Primary Index (UPI) on each table as a default. But one of the keys to Teradata’s great performance is that it utilizes the PI to physical distribute data within a table across the entire platform to optimize parallelism. Each processor gets a piece of the table based on the PI, so rows from different tables with the same PI value are co-resident and do not need to be moved when two tables are joined.

In a Third Normal Form (3NF) model no two entities (outside of super/subtypes and rare exceptions) will have the same PK, so if chosen as a PI, it stands to reason that no two tables share a PI and every table join will require data from at least one table to be moved before a join can be completed – not a solid performance decision to say the least.

The iDM’s have preselected PI’s largely based on Identifiers common across subject areas (i.e. Party Id) so that all information regarding that ID will be co-resident and joins will be AMP-local. These non-unique PI’s (NUPI’s) are a great starting point for your PDM, but again need to be evaluated against customer data and access plans to insure that both performance and reasonably even data distribution is achieved.

Even data distribution across the Teradata platform is important since skewed data can contribute both to poor performance and to space allocation (run out of space on one AMP, run out of space on all). However, it can be overemphasized to the detriment of performance.

Say, for example, a table has a PI of PRODUCT_ID, and there are a disproportionate number of rows for several Products causing skewed distribution Altering the PI to the table PK instead will provide perfectly even distribution, but remember, when joining to that table, if all elements of the PK are not available then the rows of the table will need to be redistributed, most likely by PRODUCT_ID.

This puts them back under the AMP where they were in the skewed scenario. This time instead of a “rest state” skew the rows will skew during redistribution, and this will happen every time the table is joined to – not a solid performance decision. Optimum performance can therefore be achieved with sub-optimum distribution.

iDM tables relating two common identifiers will usually have one of the ID’s pre-selected as a NUPI. In some installations the access demographics will show that other ID may be the better choice. If so, change it! Or it may give leave you with no clear choice, in which case picking one is almost assuredly better than
changing the PI to a composite index consisting of both ID’s as this will only result in a table no longer co-resident with any table indexed by either of the ID’s alone.

There are many other factors that contribute to achieving optimal performance of your physical model, but they all pale in comparison to a well-chosen PI. In my next blog we’ll look at some more of these and discuss when and how best to implement them.

Jake Kurdsjuk Biopic-resize July 15

Jake Kurdsjuk is Product Manager for the Teradata Communications Industry Data Model, purchased by more than one hundred Communications Service Providers worldwide. Jake has been with Teradata since 2001 and has 25 years of experience working with Teradata within the Communications Industry, as a programmer, DBA, Data Architect and Modeler.

Why We Love Presto

Posted on: June 24th, 2015 by Daniel Abadi No Comments


Concurrent with acquiring Hadoop companies Hadapt and Revelytix last year, Teradata opened the Teradata Center for Hadoop in Boston. Teradata recently announced that a major new initiative of this Hadoop development center will include open-source contributions to a distributed SQL query engine called Presto. Presto was originally developed at Facebook, and is designed to run high performance, interactive queries against Big Data wherever it may live --- Hadoop, Cassandra, or traditional relational database systems.

Among those people who will be part of this initiative and contributing code to Presto include a subset of the Hadapt team that joined Teradata last year. In the following, we will dive deeper into the thinking behind this new initiative from the perspective of the Hadapt team. It is important to note upfront that Teradata’s interest in Presto, and the people contributing to the Presto codebase, extends beyond the Hadapt team that joined Teradata last year. Nonetheless, it is worthwhile to understand the technical reasoning behind the embrace of Presto from Teradata, even if it presents a localized view of the overall initiative.

Around seven years ago, Ashish Thusoo and his team at Facebook built the first SQL layer over Hadoop as part of a project called Hive. At its essence, Hive was a query translation layer over Hadoop: it received queries in a SQL-like language called Hive-QL, and transformed them into a set of MapReduce jobs over data stored in HDFS on a Hadoop cluster. Hive was truly the first project of its kind. However, since its focus was on query translation into the existing MapReduce query execution engine of Hadoop, it achieved tremendous scalability, but poor efficiency and performance, and ultimately led to a series of subsequent SQL-on-Hadoop solutions that claimed 100X speed-ups over Hive.

Hadapt was the first such SQL-on-Hadoop solution that claimed a 100X speed-up over Hive on certain types of queries. Hadapt was spun out of the HadoopDB research project from my team at Yale and was founded by a group of Yale graduates. The basic idea was to develop a hybrid system that is able to achieve the fault-tolerant scalability of the Hive MapReduce query execution engine while leveraging techniques from the parallel database system community to achieve high performance query processing.

The intention of HadoopDB/Hadapt was never to build its own query execution layer. The first version of Hadapt used a combination of PostgreSQL and MapReduce for distributed query execution. In particular, the query operators that could be run locally, without reliance on data located on other nodes in the cluster, were run using PostgreSQL’s query operator set (although Hadapt was written such that PostgreSQL could be replaced by any performant single-node database system). Meanwhile, query operators that required data exchange between multiple nodes in the cluster were run using Hadoop’s MapReduce engine.

Although Hadapt was 100X faster than Hive for long, complicated queries that involved hundreds of nodes, its reliance on Hadoop MapReduce for parts of query execution precluded sub-second response time for small, simple queries. Therefore, in 2012, Hadapt started to build a secondary query execution engine called “IQ” which was intended to be used for smaller queries. The idea was that all queries would be fed through a query-analyzer layer before execution. If the query was predicted to be long and complex, it would be fed through Hadapt’s original fault-tolerant MapReduce-based engine. However, if the query would complete in a few seconds or less, it would be fed to the IQ execution engine.

presto graphic blogIn 2013 Hadapt integrated IQ with Apache Tez in order avoid redundant programming efforts, since the primary goals of IQ and Tez were aligned. In particular, Tez was designed as an alternative to MapReduce that can achieve interactive performance for general data processing applications. Indeed, Hadapt was able to achieve interactive performance on a much wider-range of queries when leveraging Tez, than what it was able to achieve previously.

Figure 1: Intertwined Histories of SQL-on-Hadoop Technology

Unfortunately Tez was not quite a perfect fit as a query execution engine for Hadapt’s needs. The largest issue was that before shipping data over the network during distributed operators, Tez first writes this data to local disk. The overhead of writing this data to disk (especially when the size of the intermediate result set was large) precluded interactivity for a non-trivial subset of Hadapt’s query workload. A second problem is that the Hive query operators that are implemented over Tez use (by default) traditional Volcano-style row-by-row iteration. In other words, a single function-invocation for a query operator would process just a single database record. This resulted in a larger number of function calls required to process a large dataset, and poor instruction cache locality as the instructions associated with a particular operator were repeatedly reloaded into the instruction cache for each function invocation. Although Hive and Tez have started to alleviate this issue with the recent introduction of vectorized operators, Hadapt still found that query plans involving joins or SQL functions would fall back to row-by-row iteration.

The Hadapt team therefore decided to refocus its query execution strategy (for the interactive query part of Hadapt’s engine) to Presto, which presented several advantages over Tez. First, Presto pipelines data between distributed query operators directly, without writing to local disk, significantly improving performance for network-intensive queries. Second, Presto query operators are vectorized by default, thereby improving CPU efficiency and instruction cache locality. Third, Presto dynamically compiles selective query operators to byte code, which lets the JVM optimize and generate native machine code. Fourth, it uses direct memory management, thereby avoiding Java object allocations, its heap memory overhead and garbage collection pauses. Overall, Presto is a very advanced piece of software, and very much in line with Hadapt’s goal of leveraging as many techniques from modern parallel database system architecture as possible.

The Teradata Center for Hadoop has thus fully embraced Presto as the core part of its technology strategy for the execution of interactive queries over Hadoop. Consequently, it made logical sense for Teradata to take its involvement in the Presto to the next level. Furthermore, Hadoop is fundamentally an open source project, and in order to become a significant player in the Hadoop ecosystem, Teradata needs to contribute meaningful and important code to the open source community. Teradata’s recent acquisition of Think Big serves as further motivation for such contributions.

Therefore Teradata has announced that it is committed to making open source contributions to Presto, and has allocated substantial resources to doing so. Presto is already used by Silicon Valley stalwarts Facebook, AirBnB, NetFlix, DropBox, and Groupon. However, Presto’s enterprise adoption outside of silicon valley remains small. Part of the reason for this is that ease-of-use and enterprise features that are typically associated with modern commercial database systems are not fully available with Presto. Missing are an out-of the-box simple-to-use installer, database monitoring and administration tools, and third-party integrations. Therefore, Teradata’s initial contributions will focus in these areas, with the goal of bridging the gap to getting Presto widely deployed in traditional enterprise applications. This will hopefully lead to more contributors and momentum for Presto.

For now, Teradata’s new commitments to open source contributions in the Hadoop ecosystem are focused on Presto. Teradata’s commitment to Presto and its commitment to making meaningful contributions to an open source project is an exciting development. It will likely have a significant impact on enterprise-adoption of Presto. Hopefully, Presto will become a widely used open source parallel query execution engine --- not just within the Hadoop community, but due to the generality of its design and its storage layer agnosticism, for relational data stored anywhere.


Learn more or download Presto now.


daniel abadi crop BLOG bio mgmtDaniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). Follow Daniel on Twitter @Daniel_Abadi


I recently participated in a business analytics project for non-profits that, as the planning progressed, seemed like a perfect opportunity to implement an agile approach, except that the work was to be completed in two days! But all the developers would be co-located. We had three objectives that fit the profile of user stories. We would cleanse, analyze, and report on the data and, hopefully, discover some insights. We would have the business stakeholders in the room with us the whole time. But doing all this in two days seemed like agile on steroids to me. And it reminded me of an old Stephen Wright joke, “I put instant coffee in the microwave and almost went back in time!”

So, if you put agile on steroids, can you go back in time? Well, maybe not, but we did accomplish a lot in those two days! The project was a DataDive, a collaboration between the non-profit, DataKind, and Teradata, that was held the two days before the Teradata Partners 2014 conference.

Blog data dive teamsI was a Data Ambassador paired with another Data Ambassador to work with a non-governmental organization (NGO) to prepare for the DataDive and make sure we reached our goals. The NGO that DataKind assigned us to was iCouldBe, an organization that provides on-line mentoring to at-risk kids at over 250 schools in the U.S. Since I am not a data scientist or analyst, I found my role as gathering requirements from the business stakeholders at iCouldBe. I worked with them to prioritize the requirements and identify the expected business value. Sounds like the product owner role in “Scrum” -- right? My partner Data Ambassador worked with the head of IT at iCouldBe to identify the data we needed and worked to prepare it for the data dive. This is similar to a Scrum project, where preparatory work must be completed to be ready for the first sprint.

DataKind wanted us to identify the tasks to accomplish each user story, so I immediately thought about using a task board for the actual DataDive. I created one ahead of time in Excel that identified the tasks for each user story as well as the development and handoff phases for each story. I didn’t realize it at the time, but I was creating a Kanban board (a portion of the board is shown in the picture) that allowed us to track workflow.

Blog - Data dive KanbanOnce I got to the DataDive, I recreated the Kanban board using flip chart paper and used sticky notes for the tasks, much the way it might be done for a real project. The user stories were listed in priority order from top to bottom. The tasks represented the metrics, dimensions, text and other analysis required to address the user stories. Some tasks supported multiple user stories, so we noted those and used that “re-use” to help prioritize. We placed these reusable tasks at the top of the board in the swimlane with the highest priority user story. (Click on the figure at left to enlarge - DataDive Kanban Board - Partial Workflow)


For example, the number of posts and words per post that mentors and mentees made in the online mentoring program was an important metric that iCouldBe wanted to calculate to help identify successful mentee completion of the program. Are mentees that write more posts and words per post more likely to complete the program? This question addresses the first user story. But number of posts and words per post can also be used to analyze the amount of engagement between mentors and their mentees and what areas of the curriculum need to be improved.

As the volunteers arrived, they chose tasks, focusing on the high priority tasks first, wrote their name on the sticky notes, and moved the note to the first development column, which was to review the available data.

blog data dive - whiteboardAt different times during the day, DataKind asked each team to review what they had done so far, and what they planned on doing next, similar to the daily standup in Scrum (and we actually did stand).

As the DataDive progressed to day two, only tasks for user stories 1 and 2 progressed across the board, but I reminded the team that some of the tasks we completed for the first two user stories also helped address the third user story. At the end of the DataDive, to better visually show this, I moved some of the sticky notes from user story 1 into the user story 3 swimlane. This way, we could show the business stakeholders from iCouldBe that, although we focused on the higher priority user stories 1 and 2, we had also partially addressed user story 3.

Although this project did not check all the boxes in being a standard agile implementation, it served as a great opportunity for me to put some agile practices in motion in a real project and learn from it. One of the most important aspects was the close collaboration between the developers and stakeholders. It was great to see how thrilled the stakeholders were with the work we had accomplished in just two days!

While I wish I could go back in time and do the DataDive all over again, as it was a great personal experience for me, instead I’ll look to the future and apply what I’ve learned from this project to my next agile project.

Blog ElissaElisia Getts is a Sr. Product Manager, Certified Scrum Master (CSM), and member of the Teradata Agile COE. She has been with Teradata for 15 years and has over 25 years of experience in IT as a product manager, business/IT consultant, programmer/analyst, and technical writer supporting industries such as travel and hospitality, transportation and logistics, and defense. She is the team’s expert on Scrum.

Your Big Data Initiative may not Require Logical Modeling

Posted on: May 12th, 2015 by Guest Blogger No Comments


By: Don Tonner

Logical Modeling may not be required on your next big data initiative.  From experience, I know when building things from scratch that a model reduces development costs, improves quality, and gets me to market quicker.  So why would I say you may not require logical modeling?

Most data modelers are employed in forward engineering activities in which the ultimate goal is to create a database or an application used by companies to manage their businesses.  The process is generally:

  • Obtain an understanding of the business concepts that the database will serve.
  • Organize the business information into structured data components and constraints—a logical model.
  • Create data stores based on the logical model and let the data population and manipulation begin.

Forward engineering is the act of going from requirements to a finished product. For databases that means starting with a detailed understanding of the information of the business, which is found largely in the minds and practices of the employees of the enterprise. This detailed understanding may be thought of as a conceptual model. Object Role Model diagramVarious methods have evolved to document this conceptual richness; one example is the Object Role Model.

The conceptual model (detailed understanding of the enterprise; not to be confused with a conceptual high level E/R diagram) is transformed into a logical data model, which organizes data into structures upon which relational algebra may be performed. The thinking here is very mathematical. Data can be manipulated mathematically the same way we can manipulate anything else mathematically. Just like you may write an equation that expresses how much material it might take for a 3D printer to create a lamp, you may write an equation to show the difference between the employee populations of two different corporate regions.

The image that most of us have of a data model is not equations, variables or valid operations, but is the visual representation of the structures that represent the variables. Below you can see structures as well as relationships which are a kind of constraint.

UData Structures and Relationshipsltimately these structures and constraints will be converted into data stores, such as tables, columns, indexes and data types, which will be populated with data that may be constrained by some business rules.

Massively parallel data storage architectures are becoming increasingly popular as they address the challenges of storing and manipulating almost unimaginable amounts of data.   The ability is to ingest data quickly is critical as volumes increase. One approach is receiving the data without prior verification of the structure. HDFS files or JSON datatypes are examples of storage that do not require knowledge of the structure prior to loading.

OK, imagine a project where millions of readings from hundreds of sensors from scores of machines are collected every shift, possibly into a data lake. Engineers discover that certain analytics performed on the machine data can potentially alert us to conditions that may warrant operator intervention. Data scientists will create several analytic metrics based on hourly aggregates of the sensor data. What’s the modeler’s role in all this?

The models you are going to use on your big data initiative likely already exist.  All you have to do is find them.

One thing would be to reverse engineer a model of the structures of the big data, which can provide visual clues to the meaning of the data. Keep in mind that big data sources may have rapidly changing schemas, so reverse engineering may have to occur periodically on the same source to gather potential new attributes. Also remember that a database of any kind is an imperfect representation of the logical model, which is itself an imperfect representation of the business. So there is much interpretation required to go from the reverse engineered model to a business understanding of the data.

One would also start reviewing an enterprise data model or the forward engineered data warehouse model. After all, while the big data analytic can help point out which engines are experiencing conditions that need attention, when you can match those engine analytics to the workload that day, the experience level of the operator, the time since the last maintenance, then you are greatly expanding the value of that analytic.

So how do you combine the data together from disparate platforms? A logical modeler in a forward engineering environment assures that all the common things have the same identifiers and data types and this is built into the system. That same skill set needs to be leveraged if there is going to be any success performing cross-platform analytics. The identifiers of the same things on the different platforms need to be cross validated in order to make apples to apples comparisons. If analytics are going to be captured and stored in the existing Equipment Scores section of the warehouse, the data will need to be transformed to the appropriate identifiers and data types. If the data is going to be joined on the fly via Teradata QueryGrid™, knowledge of these id’s and datatypes is essential for success and performance.

There are many other modern modeling challenges, let me know what has your attention.

Don Tonner, Teradata Architecture and Modeling Solutions team Don Tonner is a member of the Architecture and Modeling Solutions team, and has worked on several cool projects such as Teradata Mapping Manager, the unification modules, and Solution Modeling Building Blocks.  He is currently creating an Industry Dimensions development kit and working out how models might be useful when combining information from disparate platforms.  You can also reach him on Twitter, @BigDataDon.