Teradata Uses Open Source to Expand Access to Big Data for the Enterprise

Posted on: September 30th, 2015 by Data Analytics Staff No Comments


By Mark Shainman, Global Program Director, Competitive Programs

Teradata’s announcement of the accelerated release of enterprise-grade ODBC/JDBC drivers for Presto opens up an ocean of big data on Hadoop to the existing SQL-based infrastructure. For companies seeking to add big data to their analytical mix, easy access through Presto can solve a variety of problems that have slowed big data adoption. It also opens up new ways of querying data that were not possible with some other SQL on Hadoop tools. Here’s why.

One of the big questions facing those who toil to create business value out of data is how the worlds of SQL and big data come together. After the first wave of excitement about the power of Hadoop, the community quickly realized that because of SQL’s deep and wide adoption, Hadoop must speak SQL. And so the race began. Hive was first out of the gate, followed by Impala and many others. The goal of all of these initiatives was to make the repository of big data that was growing inside Hadoop accessible through SQL or SQL-like languages.

In the fall of 2012, Facebook determined that none of these solutions would meet its needs. Facebook created Presto as a high-performance way to run SQL queries against data in Hadoop. By 2013, Presto was in production and released as open source in November of that year.

In 2013, Facebook found that Presto was faster than Hive/MapReduce for certain workloads, although there are many efforts underway in the Hive community to increase its speed. Facebook achieved these gains by bypassing the conventional MapReduce programming paradigm and creating a way to interact with data in HDFS, the Hadoop file system, directly. This and other optimizations at the Java Virtual Machine level allow Presto not only to execute queries faster, but also to use other stores for data. This extensibility allows Presto to query data stored in Cassandra, MySQL, or other repositories. In other words, Presto can become a query aggregation point, that is, a query processor that can bring data from many repositories together in one query.

In June 2015, Teradata announced a full embrace of Presto. Teradata would add developers to the project, add missing features both as open source and as proprietary extensions, and provide enterprise-grade support. This move was the next step in Teradata’s effort to bring open source into its ecosystem. The Teradata Unified Data Architecture provides a model for how traditional data warehouses and big data repositories can work together. Teradata has supported integration of open source first through partnerships with open source Hadoop vendors such as Hortonworks, Cloudera, and MapR, and now through participation in an ongoing open source project.

Teradata’s embrace of Presto provided its customers with a powerful combination. Through Teradata QueryGrid, analysts can use the Teradata Data Warehouse as a query aggregation point and gather data from Hadoop systems, other SQL systems, and Presto. The queries in Presto can aggregate data from Hadoop, but also from Cassandra and other systems. This is a powerful capability that enables Teradata’s Unified Data Architecture to enable data access across a broad spectrum of big data platforms.

To provide Presto support for mainstream BI tools required two things: ANSI SQL support and ODBC/JDBC drivers. Much of the world of BI access works through BI toolsets that understand ANSI SQL. A tool like QlikView, MicroStrategy, or Tableau allows a user to easily query large datasets as well as visualize the data without having to hand-write SQL statements, opening up the world of data access and data analysis to a larger number of users. Having robust BI tool support is critical for broader adoption of Presto within the enterprise.

For this reason, ANSI SQL support is crucial to making the integration and use of BI tools easy. Many of the other SQL on Hadoop projects are limited in SQL support or utilize proprietary SQL “like” languages. Presto is not one of them. To meet the needs of Facebook, SQL support had to be strong and conform to ANSI standards, and Teradata’s joining the project will make the scope and support of SQL by Presto stronger still.

The main way that BI tools connect and interact with databases and query engines is through ODBC/JDBC drivers. For the tools to communicate well and perform well, these drivers have to be solid and enterprise class. That’s what yesterday’s announcement is all about.

Teradata has listened to the needs of the Presto community and accelerated its plans for adding enterprise-grade ODBC/JDBC support to Presto. In December, Teradata will make available a free, enterprise class, fully supported ODBC driver, with a JDBC driver to follow in Q1 2016. Both will be available for download on Teradata.com.

With ODBC/JDBC drivers in place and the ANSI SQL support that Presto offers, anyone using modern BI tools can access data in Hadoop through Presto. Of course, certification of the tools will be necessary for full functionality to be available, but with the drivers in place, access is possible. Existing users of Presto, such as Netflix, are extremely happy with the announcement. As Kurt Brown, Director, Data Platform at Netflix put it, “Presto is a key technology in the Netflix big data platform. One big challenge has been the absence of enterprise-grade ODBC and JDBC drivers. We think it’s great that Teradata has decided to accelerate their plans and deliver this feature this year.”

Enterprise-ready Hadoop, Now Available as an Appliance

Posted on: September 28th, 2015 by Guest Blogger No Comments


By: Clarke Patterson, senior director of product marketing, Cloudera

Early this summer, Teradata and Cloudera jointly announced the Teradata Appliance for Hadoop with Cloudera, an engineered, ready-to-run appliance that comes with enterprise-ready Cloudera Enterprise, in addition to our existing software integrations.

Today, at Strata + Hadoop World at New York, we are excited to announce the ability for customers to now order the Teradata Appliance for Hadoop with Cloudera.

Over the last couple years, we have certainly seen the maturation of Hadoop and the shift from using Hadoop as a proof-of concept technology to an enterprise-ready platform. However, the time, skillsets, and resources needed is hard to come by, and not every organization has the ability to hire the best talents in the market to plan, deploy, and manage Hadoop clusters, let alone support and maintain the platform post-production.

The Teradata Appliance for Hadoop with Cloudera is built to satisfy the need to stand up a Hadoop cluster quickly and cost-effectively. Having an appliance allows organizations to simplify and accelerate the cluster deployment, enabling customers to focus their IT resources on fine-tuning the infrastructure to deliver business value, rather than investing valuable resources in the details of deployment, management, and support of the platform.

In addition to the benefits of an appliance form-factor, the Teradata Appliance for Hadoop with Cloudera also delivers all the benefits of enterprise-ready Hadoop with Cloudera Enterprise:

  • Enterprise security and governance for all mission-critical workloads – With Apache Sentry and Cloudera Navigator, Cloudera Enterprise provides multiple layers of security and governance that are built to maintain the business agility and flexibility that Hadoop provides, while providing the security necessary to meet stringent security regulations and requirements. Being compliance-ready at the core, Cloudera Enterprise is the only distribution that is fully PCI-certified.
  • Industry-Leading Management and SupportCloudera Manager features a best-in-class holistic interface that provides end-to-end system management and zero-downtime rolling upgrades. Combining the power of Cloudera Manager with Teradata Viewpoint and Teradata Vital Infrastructure, Teradata Appliance for Hadoop with Cloudera provides intuitive tools for centralized management with powerful capabilities, even as the system scales.
  • Built on open standards – Cloudera is the leading open source Hadoop contributor, having added more major, enterprise-ready features to the Hadoop ecosystem, not just to the core. Over the years, Cloudera has been working with large ecosystem of partner and development community members to promote open standards for data access and governance through Cloudera Accelerator Program and One Platform Initiatives. With the Apache-licensed open source model, Cloudera ensures that data and applications remain the customer’s, and an open platform to connect with all of their existing investments in technology and skills.

With all the hustle and bustle of Strata + Hadoop World this week, don’t forget to stop by the Cloudera booth and the Teradata booth to talk to us about the Teradata Appliance for Hadoop with Cloudera!

Clarke Patterson, product marketing, ClouderaClarke Patterson is the senior director of product marketing at Cloudera, responsible for  Cloudera’s Platform for Big Data. Clarke joined Cloudera after spending almost three years in a similar role at Informatica. Prior to Informatica he held product management positions at IBM, Informix and Red Brick Systems.  Clarke brings over 17 years of leadership experience to Cloudera having lead teams in product marketing, product management and engineering. He holds a Bachelor of Science degree from the University of Calgary and an MBA from Duke University’s Fuqua School of Business.

Education Planning – Leveraging the “Five W’s”

Posted on: September 21st, 2015 by Debi Hoefer No Comments


By Debi Hoefer, Director, Teradata Americas Education

First in a series about how customers can learn how to get the most value from Teradata.

I’m sure many of you have heard of the “Five W's” – Who, What, Where, When, Why? The answers to these questions can be used for structured, straightforward information gathering. In this blog, I’ll explain how this technique can be used for collecting the details needed to plan Teradata education. Once you answer the “Five W's,” you can create a blueprint for training.

Teradata Education Planning – Leveraging the Five Ws


The first question to ask, is Why? What is the reason the training is needed? Before attempting to answer any of the other “W's,” it is imperative you understand the business reasons behind the training. The answer to this question serves as a foundation for the answers to the remaining “W's.”

To illustrate this point, among the possible answers could be one or a combination of:

  • New project using a new technology – identify the technology products/tools in which new skill development is needed.
  • New team member(s) – identify the project(s) will they be assigned to, as well as technologies and tools used in each project.
  • Skills gap – this may be a bit tricky to identify, and usually manifests itself as poor performance or inefficient queries.


Okay, great! Now we understand why training is needed, and those reasons should drive the answer to the next “W,” Who? Who are the people to be trained? Do they have existing skill sets? Do they need deep technical skills or more general knowledge? Are they all in one location or around the world? Depending on the size of the group, you may want to informally collect information, or build a survey to gather data for larger groups.


Moving on to the next “W,” What? What training content will build their skills?

The Teradata education curriculum is organized by job role, which provides a discrete and clear training path for each audience. We use a five-role structure to classify job functions:

  • Database Administrator
  • Designer/Architect
  • Application Developer
  • Data Analyst
  • Business User

Our Teradata Education Consultants can assist at during this step of the process, or the Education Planning page on the Teradata Education Network website can be used as a guide.

One subject area that may need a bit more analysis is SQL. Teradata offers a variety of SQL courses, ranging from basic to advanced content. A very straightforward classification can be used for this part of the data gathering process:

  • No SQL Experience
  • Some Experience with SQL
  • SQL Veteran

Based on participants’ previous SQL experience, we can recommend the most appropriate course.


Now, we know why we’re doing training, who needs training, and what training they need. We almost have our blueprint. Now, we need to establish the education deployment plan. The answers to the last two “W's,” Where and When, are used to determine how the training will be delivered, as well as the timing and format.

Where?, as in the training delivery format for each course in the blueprint, is next. Most training plans for new implementations or projects contain a mix of on-site and self-paced or virtual instructor-led training, based on the number of participants who require each course specified in the blueprint. In my next post I will talk more about virtual instructor-led classes.

If the course is self-paced or virtual instructor-led, we don’t need to explore further – “Where” can be any place around the globe with an internet connection. If on-site training is needed, class location can be determined by the geographical locations of the participants, convenience, cost implications, and classroom availability.


The final “W” is When? When is determined by various factors, such as:

  • When will work on the project need to commence?
  • When will the business need the training completed?
  • When are participants available?
  • When are Teradata public course offerings scheduled?
  • When is a classroom available in the planned location?

Although education planning for the implementation of a new technology may at first seem daunting, using the simple approach of the “Five W's” will help you gather your data simply and methodically.

For more information on Teradata Education courses, and to learn more about our new subscription-based approach to training, visit Teradata Education.


By Imad Birouty, Teradata Product Marketing Manager

In-memory database processing is a hot topic in the market today. It promises to bring high performance to OLTP and Data Warehouse environments.  As such, many vendors are working hard to develop in-memory database technology.

Memory is fast, but still expensive when compared to disk storage. As such, it should be treated as a precious resource and used wisely for the best return on your investment.

Teradata Intelligent Memory  does just that. Through advanced engineering techniques, the Teradata Database automatically places the most frequently accessed data in memory, delivering in-memory performance with the cost economics of disk storage. The 80/20 rule and proven real-world data warehouse usage patterns shows that a small percentage of the data accounts for the vast majority of data access. Teradata Database’s unique multi-temperature data management infrastructure makes it possible to leverage this and keep only the most frequently used data in memory to achieve in-memory performance for the entire database. This is cutting-edge technology and does not require a separate dedicated in-memory database to manage. And because it's built into the Teradata Database, companies get the scalability, manageability, and robust features associated with the Teradata Database.

Forrester Research just released their inaugural Wave dedicated to in-memory:  The Forrester Wave™: In-Memory Database Platforms, Q3 2015 evaluation, naming Teradata a leader. Teradata has always been a pioneer in scalable, disk-based, shared-nothing RDBMS.  Because it has continued to evolve, change, and incorporate the latest technologies, the Teradata Database is now a leader in in-memory database processing too.

While the Forrester Wave evaluated Teradata Database 15.0., we are even more excited about Teradata Database 15.10 which utilizes even more advanced in-memory techniques that are integrated into the Teradata Database. New in-memory accelerators such as pipelining, vectorization, bulk qualification, and columnar storage are integrated into the Teradata Database and bring in-memory performance to all data in the warehouse, including multi-structured data types such as JSON and weblogs which are associated with Big Data.

A free copy of the Forrester Wave report is available here, as well as today’s news release here. 

We’ll be announcing availability of Teradata Database 15.10 in a few weeks, so look for that announcement.


Optimization in Data Modeling 1 – Primary Index Selection

Posted on: July 14th, 2015 by Guest Blogger No Comments


In my last blog I spoke about the decisions that must be made when transforming an Industry Data Model (iDM) from Logical Data Model (LDM) to an implementable Physical Data Model (PDM). However, being able to generate DDL (Data Definition Language) that will run on a Teradata platform is not enough – you also want it to perform well. While it is possible to generate DDL almost immediately from a Teradata iDM, each customer’s needs mandate that existing structures be reviewed against data and access demographics, so that optimal performance can be achieved.

Having detailed data and access path demographics during PDM design is critical to achieving great performance immediately, otherwise it’s simply guesswork. Alas, these are almost never available at the beginning of an installation, but that doesn’t mean you can’t make “excellent guesses.”

The single most influential factor in achieving PDM performance is proper Primary Index (PI) selection for warehouse tables. Data modelers are focused on entity/table Primary Keys (PK) since it is what defines uniqueness at the row level. Because of this, a lot of physical modelers tend to implement the PK as a Unique Primary Index (UPI) on each table as a default. But one of the keys to Teradata’s great performance is that it utilizes the PI to physical distribute data within a table across the entire platform to optimize parallelism. Each processor gets a piece of the table based on the PI, so rows from different tables with the same PI value are co-resident and do not need to be moved when two tables are joined.

In a Third Normal Form (3NF) model no two entities (outside of super/subtypes and rare exceptions) will have the same PK, so if chosen as a PI, it stands to reason that no two tables share a PI and every table join will require data from at least one table to be moved before a join can be completed – not a solid performance decision to say the least.

The iDM’s have preselected PI’s largely based on Identifiers common across subject areas (i.e. Party Id) so that all information regarding that ID will be co-resident and joins will be AMP-local. These non-unique PI’s (NUPI’s) are a great starting point for your PDM, but again need to be evaluated against customer data and access plans to insure that both performance and reasonably even data distribution is achieved.

Even data distribution across the Teradata platform is important since skewed data can contribute both to poor performance and to space allocation (run out of space on one AMP, run out of space on all). However, it can be overemphasized to the detriment of performance.

Say, for example, a table has a PI of PRODUCT_ID, and there are a disproportionate number of rows for several Products causing skewed distribution Altering the PI to the table PK instead will provide perfectly even distribution, but remember, when joining to that table, if all elements of the PK are not available then the rows of the table will need to be redistributed, most likely by PRODUCT_ID.

This puts them back under the AMP where they were in the skewed scenario. This time instead of a “rest state” skew the rows will skew during redistribution, and this will happen every time the table is joined to – not a solid performance decision. Optimum performance can therefore be achieved with sub-optimum distribution.

iDM tables relating two common identifiers will usually have one of the ID’s pre-selected as a NUPI. In some installations the access demographics will show that other ID may be the better choice. If so, change it! Or it may give leave you with no clear choice, in which case picking one is almost assuredly better than
changing the PI to a composite index consisting of both ID’s as this will only result in a table no longer co-resident with any table indexed by either of the ID’s alone.

There are many other factors that contribute to achieving optimal performance of your physical model, but they all pale in comparison to a well-chosen PI. In my next blog we’ll look at some more of these and discuss when and how best to implement them.

Jake Kurdsjuk Biopic-resize July 15

Jake Kurdsjuk is Product Manager for the Teradata Communications Industry Data Model, purchased by more than one hundred Communications Service Providers worldwide. Jake has been with Teradata since 2001 and has 25 years of experience working with Teradata within the Communications Industry, as a programmer, DBA, Data Architect and Modeler.


It is well-known that there are two extreme alternatives for storing database tables on any storage media: storing it row-by-row (as done by traditional “row-store” technology) or storing it column-by-column (as done by recently popular “column-store” implementations). Row-stores store the entire first row of the table, followed by the entire second row of the table, etc. Column-stores store the entire first column of the table, followed by the entire second column of the table, etc. There have been huge amounts of research literature and commercial whitepapers that discuss the various advantages of these alternative approaches, along with various proposals for hybrid solutions (which I discussed in more detail in my previous post).

abadi blog clamp image abadiDespite the many conflicting arguments in favor of these different approaches, there is little question that column-stores compress data much better than row-stores. The reason is fairly intuitive: in a column-store, entire columns are stored contiguously --- in other words, a series of values from the same attribute domain are stored consecutively. In a row-store, values from different attribute domains are interspersed, thereby reducing the self-similarity of the data. In general the more self-similarity (lower entropy) you have in a dataset, the more compressible it is. Hence, column-stores are more compressible than row-stores.

In general, compression rates are very sensitive to the particular dataset that is being compressed. Therefore it is impossible to make any kind of guarantees about how much a particular database system/compression algorithm will compress an arbitrary dataset. However, as a general rule of thumb, it is reasonable to expect around 8X compression if a column-store is used on many kinds of datasets. 8X compression means that the compressed dataset is 1/8th the original size, and scan-based queries over the dataset can thus proceed approximately 8 times as fast. This stellar compression and resulting performance improvements are a major contributor to the recent popularity of column-stores.

It is precisely this renowned compression of column-stores which makes the compression rate of RainStor (a recent Teradata acquisition) so impressive in comparison. RainStor claims a factor of 5 times more compression than what column-stores are able to achieve on the same datasets, and 40X compression overall.

Although the reason why column-stores compress data better than row-stores is fairly intuitive, the reason why RainStor can compress data better than column-stores is less intuitive. Therefore, we will now explain this in more detail.

Take for example the following table, which is a subset of a table describing orders from a particular retail enterprise that sells bicycles and related parts. (A real table would have many more rows and columns, but we keep this example simple so that it is easier to understand what is going on).

Record Order date Ship date Product Price
1 03/22/2015 03/23/2015 “bicycle” 300
2 03/22/2015 03/24/2015 “lock” 18
3 03/22/2015 03/24/2015 “tire” 70
4 03/22/2015 03/23/2015 “lock” 18
5 03/22/2015 03/24/2015 “bicycle” 250
6 03/22/2015 03/23/2015 “bicycle” 280
7 03/22/2015 03/23/2015 “tire” 70
8 03/22/2015 03/23/2015 “lock” 18
9 03/22/2015 03/24/2015 “bicycle” 280
10 03/23/2015 03/24/2015 “lock” 18
11 03/23/2015 03/25/2015 “bicycle” 300
12 03/23/2015 03/24/2015 “bicycle” 280
13 03/23/2015 03/24/2015 “tire” 70
14 03/23/2015 03/25/2015 “bicycle” 250
15 03/23/2015 03/25/2015 “bicycle” 280


The table contains 15 records and shows four attributes --- the order and ship dates of a particular product; the product that was purchased, and the purchase price. Note that there is a relationship between some of these columns --- in particular the ship date is usually 1 or 2 days after the order date, and that the price of various products are usually consistent across orders, but there may be slight variations in price depending on what coupons the customer used to make the purchase.

A column-store would likely use “run-length encoding” to compress the order date column. Since records are sorted by order date, this would compress the column to its near-minimum --- it can be compressed as (03/22/2015, 9); (03/23/2015, 6) --- which indicates that 03/22/2015 is repeated 9 straight times, followed by 03/23/2015 which is repeated 6 times. The ship date column, although not sorted, is still very compressible, as each value can be expressed using a small number of bits in terms of how much larger (or smaller) it is from the previous value in the column. However, the other two columns --- product and price --- would likely be compressed using a variant of dictionary compression, where each value is mapped to the minimal number of bits needed represent it. For large datasets, where there are many unique values for price (or even for product), the number of bits needed to represent a dictionary entry is non-trivial, and the same dictionary entry is repeated in the compressed dataset for every repeated value in the original dataset.

In contrast, in RainStor, every unique value in the dataset is stored once (and only once), and every record is represented as a binary tree, where a breadth-first traversal of the tree enables the reconstruction of the original record. For example, the table shown above is compressed in RainStor using the forest of binary trees shown below. There are 15 binary trees (each of the 15 roots of these trees are shown using the green circles at the top of the figure), corresponding to the 15 records in the original dataset.abadi forest trees blog

Forest of Binary Trees Compression

For example, the binary tree corresponding to record 1 is shown on the left side of the figure. The root points to two children --- the internal nodes “A” and “E”. In turn, node “A” points to 03/22/2015 (corresponding to the order date of record 1), and to 03/23/2015 (corresponding to the ship date of record 1). Node “E” points to “bicycle” (corresponding to the product of record 1) and “300” corresponding to the price of record 1).

Note that records 4, 6, and 7 also have an order date of 03/22/2015 and a ship date of 03/23/2015. Therefore, the roots of the binary trees corresponding to those records also point to internal node “A”. Similarly, note that record 11 also is associated with the purchase of a bicycle for $300. Therefore, the root for record 11 also points to internal node “E”.

These shared internal nodes are what makes RainStor’s compression algorithm fundamentally different from any algorithm that a column-store is capable of performing. Column-stores are forced to create dictionaries and search for patterns only within individual columns. In contrast, RainStor’s compression algorithm finds patterns across different columns --- identifying the relationship between ship date and order date and the relationship between product and price, and leveraging these relationships to share branches in the trees that are formed, thereby eliminating redundant information. RainStor thus has fundamentally more room to search for patterns in the dataset and compress data by referencing these patterns via the (compressed) location of the root of the shared branch.

For a traditional archiving solution, compression rate is arguably the most important feature (right up there with immutability). Indeed, RainStor’s compression algorithm enables it to be used for archival use-cases, and RainStor provides all of the additional features you would expect from an archiving solution: encryption, LDAP/AD/PAM/Kerberos/PCI authentication and security, audit trails and logging, retention rules, expiry policies, and integrated implementation of existing compliance standards (e.g. SEC 17a-4).

However, what brings RainStor to the next level in the archival solutions market is that it is an “active” archive, meaning that the data that is managed by RainStor can be queried at high performance. RainStor provides a mature SQL stack for native querying of compressed RainStor data, including ANSI SQL 1992 and 2003 parsers, and a full MPP query execution engine. For enterprises with Hadoop clusters, RainStor is fully integrated with the Cloudera and Hortonworks distributions of Hadoop --- RainStor compressed data files can be partitioned over a HDFS cluster, and queried in parallel with HiveQL (or MapReduce or Pig). Furthermore, RainStor integrates with YARN for resource management, with HCatalog for metadata management, and with Ambari for system monitoring and management.

The reason why most archival solutions are not “active” is that the compression algorithms used to reduce the data size before archival are so heavy-weight, that significant processing resources must be invested in decompressing the data before it can be queried. Therefore, it is preferable to leave the data archived in compressed form, and only decompress it at times of significant need. In general, a user should expect significant query performance reductions relative to querying uncompressed data, in order to account for the additional decompression time.

The beauty of RainStor’s compression algorithm is that even though it gets compression ratios comparable to other archival products, its compression algorithm is not so heavy-weight that the data must be decompressed prior to querying it. In particular, the binary tree structures shown above are actually fairly straightforward to perform query operations on directly, without requiring decompression prior to access. For example, a count distinct or a group-by operation can be performed via a scan of the leaves of the binary tees. Furthermore, selections can be performed via a reverse traversal of the binary trees from the leaves that match the selection predicate. In general, since there is a one-to-one mapping of records in the uncompressed dataset to the binary trees in RainStor’s compressed files, all query operations can be expressed in terms of operations on these binary trees. Therefore, RainStor queries can benefit from the I/O improvement of scanning in less data (due to the smaller size of the compressed files on disk/memory) without paying the decompression cost to fully decompress these compressed files after they are read from storage. This leads to RainStor’s claims of 2X-100X performance improvement on most queries --- an industry-leading claim in the archival market.

In short, RainStor’s strong claims around compression and performance are backed up by the technology that is used under the covers. Its compression algorithm is able to identify and remove redundancy both within and across columns. Furthermore, the resulting data structures produced by the algorithm are amenable to direct operation on the compressed data. This allows the compressed files to be queried at high performance, and positions RainStor as a leading active-archive solution.


daniel abadi crop BLOG bio mgmt

Daniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). http://twitter.com/#!/daniel_abadi.

Real-Time SAP® Analytics: a look back and ahead

Posted on: August 18th, 2014 by Patrick Teunissen 4 Comments


On April 8, I hosted a webinar and my guest was Neil Raden, an independent data warehouse analyst. The topic of the webinar was: “Accessing of SAP ERP data for business analytics purposes” – which was built upon Neil’s findings in his recent white paper about the complexities of the integration of SAP data into the enterprise data warehouse. The attendance and participation in the webinar clearly showed that there is a lot of interest and expertise in this space. As I think back about the questions we received, both Neil and I were surprised by the number of questions that were related to “real-time analytics on SAP.”

Something has drastically changed in the SAP community!

Note: The topic of real time analytics is not new! I won’t forget Neil’s reaction when the questions came up. It was like he was in a time warp back to the early 2000’s when he first wrote about that topic. Interestingly, Neil’s work is still very relevant today.

This made me wonder why this is so prominent in the SAP space now? What has changed in the SAP community? What has changed in the needs of the business?

My hypothesis is that when Neil originally wrote his paper (in 2003) R/3 was SAP (or SAP was R/3 whatever order you prefer) and integration with other applications or databases was not something that SAP had on the radar yet. This began to change when SAP BW became more popular and gained even more traction with the release of SAP’s suite of tools and modules (CRM, SRM, BPC, MDM, etc.) -- although these solutions still clearly had the true SAP ‘Made in Germany’ DNA. Then came SAP’s planning tool APO, Netweaver XI (later PI) and, the 2007 acquisition of Business Objects (including BODS) which all accelerated SAP’s application integration techniques.

With Netweaver XI/PI and Business Objects Data Services, it became possible to integrate SAP R/3 in real time, making use of advanced messaging techniques like Idoc’s, RFC’s, and BAPI’s. These techniques all work very well for transaction system integration (EAI); however, these techniques do not have what it takes to provide real-time data feeds to the integrated data warehouse. At best a hybrid approach is possible. Back in 2000 my team worked on such a hybrid project at Hunter Douglas (Luxaflex). They combined classical ABAP-driven batch loads for managerial reports with real time capabilities (BAPI calls) for their more operational reporting needs. That was state-of-art in those days!

Finally, in 2010 SAP acquired Sybase and added a best of breed Data Replication software tool to the portfolio. With this integration technique, changed data is captured directly from the database taking the loads off of the R/3 application servers. This offers huge advantages, so it makes sense that this is now the recommended technique for loading data into the SAP HANA appliance.

“What has changed is that SAP has put the need for real-time data integration with R/3 on the (road) map!”

The main feature of our upcoming release of Teradata Analytics for SAP Solutions version 2.2 is a new data replication technique. Almost designed to prove my case, 10 years ago I was in the middle of working on a project for a large multinational company. One of my lead engineers, Arno Luijten, came to me with a proposal to try out a data replication tool to address the latencies introduced by the extraction of large volumes of changed data from SAP. We didn’t get very far at the time, because the technology and the business expectations were not ready for it. Fast forward to 2014 and we’re re-engaged with this same customer …. Luckily this time the business needs and the technology capabilities are ready to deliver!

In the coming months my team and I would like to take you on our SAP analytics journey.

In my next posts we will dive into the definition (and relativity) of real-time analytics and discuss the technical complexities of dealing with SAP including the pool and cluster tables. So, I hope I got you hooked for the rest of the series!

Garbage In-Memory, Expensive Garbage

Posted on: July 7th, 2014 by Patrick Teunissen 2 Comments


A first anniversary is always special and in May I marked my first with Teradata. In my previous lives I celebrated almost ten years with Shell and seventeen years creating my own businesses focused on data warehousing and business intelligence solutions for SAP. With my last business “NewFrontiers” I leveraged all twenty seven years of ERP experiences to develop a shrink wrapped solution to enable SAP analytics. 

Through my first anniversary with Teradata, all this time, the logical design of SAP has been the same. To be clear, when I say SAP, I mean R/3 or ‘R/2 with a mouse’ if you’re old enough to remember. Today R/3 is also known as the SAP Business suite, ERP or whatever. Anyway, when I talk about SAP I mean the application that made the company rightfully world famous and that is used for transaction processing by almost all large multinational businesses.

My core responsibility at Teradata is the engineering of the analytical solution for SAP. My first order of business was focusing my team on delivering an end-to-end business analytic product suite to analyze ERP data that is optimized for Teradata. Since completing our first release, my attention turned to adding new features to help companies take their SAP analytics to the next level. To this end, my team is just putting the finishing touches on a near real-time capability based on data replication technology. This will definitely be the topic of upcoming blogs.

Over the past year, the integration and optimization process has greatly expanded my understanding of the differentiated Teradata capabilities. The one capability that draws in the attention of types like me ‘SAP guys and girls’ is Teradata Intelligent Memory. In-memory computing has become a popular topic in the SAP community and the computer’s main memory is an important part of Teradata’s Intelligent Memory. However Intelligent Memory is more than “In-Memory” -- because with Intelligent Memory, the database addresses the fact that not all memory is created equal and delivers a solution that uses the “right memory for the right purpose”. In this solution, the most frequently used data – the hottest -- is stored In-Memory; the warm data is processed from a solid state drive (SSD), and colder, less frequently accessed data from a hard disc drive (HDD). This solution allows your business to make decisions on all of your SAP and non-SAP data while coupling in-memory performance with spinning disc economics.

This concept of using the “right memory for the right purpose” is very compelling for our Teradata Analytics for SAP solutions. Often when I explain what Teradata Analytics for SAP Solutions does, I draw a line between DATA and CONTEXT. Computers need DATA like cars need fuel and the CONTEXT is where you drive the car. Most people do not go the same place every time but they do go to some places more frequently than others (e.g. work, freeways, coffee shops) and under more time pressure (e.g. traffic).

In this analogy, most organizations almost always start building an “SAP data warehouse” by loading all DATA kept in the production database of the ERP system. We call that process the initial load. In the Teradata world we often have to do this multiple times because when building an integrated data warehouse it usually involves sourcing from multiple SAP ERPs. Typically, these ERPs vary in age, history, version, governance, MDM, etc. Archival is a non-trivial process in the SAP world and the majority of the SAP systems I have seen are carrying many years of old data . Loading all this SAP data In-Memory is an expensive and reckless thing to do.

Teradata Intelligent Memory provides CONTEXT by storing the hot SAP data In-Memory, guaranteeing lightning fast response times. It then automatically moves the less frequently accessed data to lower cost and performance discs across the SSD and HDD media spectrum. The resulting combination of Teradata Analytics for SAP coupled with Teradata’s Intelligent Memory delivers in-memory performance with very high memory hit rates at a fraction of the cost of ‘In-Memory’ solutions. And in this business, costs are a huge priority.

The title of this Blog is a variation on the good old “Garbage In Garbage Out / GIGO” phrase; In-Memory is a great feature, but not all data needs to go there! Make use of it in an intelligent way and don’t use it as a garbage dump because for that it is too expensive.

Patrick Teunissen is the Engineering Director at Teradata responsible for the Research & Development of the Teradata Analytics for SAP® Solutions at Teradata Labs in the Netherlands. He is the founder of NewFrontiers which was acquired by Teradata in May 2013.

1 Needless to say I am referring to SAP’s HANA database developments.

2 Data that is older than 2 years can be classified as old. Transactions, like sales and costs are often compared with the a budget/plan and the previous year. Sometimes with the year before that but hardly ever with data older than that.

MongoDB and Teradata QueryGrid – Even Better Together

Posted on: June 19th, 2014 by Dan Graham 3 Comments


It wasn’t so long ago that NoSQL products were considered competitors with relational databases (RDBMS). Well, for some workloads they still are. But Teradata is an analytic RDBMS which is quite different and complementary to MongoDB. Hence, we are teaming up for the benefit of mutual customers.

The collaboration of MongoDB with Teradata represents a virtuous cycle, a symbiotic exchange of value. This virtuous cycle starts when data is exported from MongoDB to Teradata’s Data Warehouse where it is analyzed and enriched, then sent back to MongoDB to be exploited further. Let me give an example.

An eCommerce retailer builds a website to sell clothing, toys, etc. They use MongoDB because of the flexibility to manage constantly changing web pages, product offers, and marketing campaigns. This front office application exports JSON data to the back-office data warehouse throughout the business day. Automated processes analyze the data and enrich it, calculating next best offers, buyer propensities, consumer profitability scores, inventory depletions, dynamic discounts, and fraud detection. Managers and data scientists also sift through sales results looking for trends and opportunities using dashboards, predictive analytics, visualization, and OLAP. Throughout the day, the data warehouse sends analysis results back to MongoDB where they are used to enhance the visitor experience and improve sales. Then we do it again. It’s a cycle with positive benefits for the front and back office.

Teradata Data Warehouses have been used in this scenario many times with telecommunications, banks, retailers, and other companies. But several things are different working with MongoDB in this scenario. First, MongoDB uses JSON data. This is crucial to frequently changing data formats where new fields are added on a daily basis. Historically, RDBMS’s did not support semi-structured JSON data. Furthermore, the process of changing a database schema to support frequently changing JSON formats took weeks to get through governance committees.

Nowadays, the Teradata Data Warehouse ingests native JSON and accesses it through simple SQL commands. Furthermore, once a field in a table is defined as JSON, the frequently changing JSON structures flow right into the data warehouse without spending weeks in governance committees. Cool! This is a necessary big step forward for the data warehouse. Teradata Data Warehouses can ingest and analyze JSON data easily using any BI tool or ETL tool our customers prefer.

Another difference is that MongoDB is a scale-out system, growing to tens or hundreds of server nodes in a cluster. Hmmm. Teradata systems are also scale-out systems. So how would you exchange data between Teradata Data Warehouse server nodes and MongoDB server nodes? The simple answer is to export JSON to flat files and import them to the other system. Mutual customers are already doing this. Can we do better than import/export? Can we add an interactive dynamic data exchange? Yes, and this is the near term goal of our partnership --connecting Teradata QueryGrid to MongoDB clusters.

Teradata QueryGrid and Mongo DB

Teradata QueryGrid is a capability in the data warehouse that allows a business user to issue requests via popular business intelligence tools such as SAS®, Tableau®, or MicroStrategy®. The user issues a query which runs inside the Teradata Data Warehouse. This query reaches across the network to the MongoDB cluster. JSON data is brought back, joined to relational tables, sorted, summarized, analyzed, and displayed to the business user. All of this is done exceptionally fast and completely invisible to the business user. It’s easy! We like easy.

QueryGrid can also be bi-directional, putting the results of an analysis back into the MongoDB server nodes. The two companies are working on hooking up Teradata QueryGrid right now and we expect to have the solution early in 2015.

The business benefit of connecting Teradata QueryGrid to MongoDB is that data can be exchanged in near real time. That is, a business user can run a query that exchanges data with MongoDB in seconds (or a few minutes if the data volume is huge). This means new promotions and pricing can be deployed from the data warehouse to MongoDB with a few mouse clicks. It means Marketing people can analyze consumer behavior on the retail website throughout the day, making adjustments to increase sales minutes later. And of course, applications with mobile phones, sensors, banking, telecommunications, healthcare and others will get value from this partnership too.

So why does the leading NoSQL vendor partner with the best in class analytic RDBMS? Because they are highly complementary solutions that together provide a virtuous cycle of value to each other. MongoDB and Teradata are already working together well in some sites. And soon we will do even better.

Come visit our Booth at MongoDB World and attend the session “The Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse” Riverside Suite, 3:10 p.m., June 24. You can read more about the partnership between Teradata and MongoDB in this news release issued earlier today. Also, check out the MongoDB blog.

PS: The MongoDB people have been outstanding to work with on all levels. Kudos to Edouard, Max, Sandeep, Rebecca, and others. Great people!


It happens every few years and it’s happening again. A new technology comes along and a significant segment of the IT and business community want to toss out everything we’ve learned over the past 60 years and start fresh. We “discover” that we’ve been wasting time applying unnecessary rigor and bureaucracy to our projects. No longer should we have to wait three to six months or longer to deliver technical solutions to the business. We can turn these things around in three to six days or even less.

In the mid 1990’s, I was part of a team that developed a “pilot” object-oriented, client-server (remember when these were the hot buzzwords?) application to replenish raw materials for a manufacturing function. We were upending the traditional mainframe world by delivering a solution quickly and iteratively with a small team. When the end users started using the application in real life, it was clear they were going to rely on it to do their jobs every day. Wait, was this a pilot or…? I would come into work in the morning, walk into a special room that housed the application and database servers, check the logs, note any errors, make whatever fixes needed to be made, re-run jobs, and so on.

It wasn’t long before this work began to interfere with my next project, and the end users became frustrated when I wasn’t available to fix problems quickly. It took us a while and several conversations with operations to determine that “production” didn’t just mean “the mainframe”. “Production” meant that people were relying on the solution on a regular basis to do their jobs. So we backtracked and started talking about what kind of availability guarantees we could make, how backup and recovery should work, how we could transition monitoring and maintenance to operations, and so on. In other words, we realized what we needed was a traditional IT project that just happened to leverage newer technologies.

This same scenario is happening today with Hadoop and related tools. When I visit client organizations, a frightening number will have at least one serious person saying something like, “I really don’t think ‘data warehousing’ makes sense any more. It takes too long. We should put all our data in Hadoop and let our end users access whatever they want.” It is indeed a great idea to establish an environment that enables exploration and quick-turnaround analysis against raw data and production data. But to position this approach as a core data and analytics strategy is nothing short of professional malpractice.

The problem is that people are confusing experimentation with IT projects. There is a place for both, and there always has been. Experimentation (or discovery, research, ad-hoc analysis, or whatever term you wish to use) should have lightweight processes and data management practices – it requires prioritization of analysis activity, security and privacy policies and implementation, some understanding of available data, and so on, but it should not be overburdened with the typical rigor required of projects that are building solutions destined for production. Once a prototype is ready to be used on a regular basis for important business functions, that solution should be built through a rigorous IT project leveraging an appropriate – dare I say it – solution development life cycle (SDLC), along with a comprehensive enterprise architecture plan including, yes, a data warehouse that provides integrated, shared, and trusted production data.

An experimental prototype should never be “promoted” to a production environment. That’s what a project is for. Experimentation can be accomplished with Hadoop, relational technology, Microsoft Office, and many other technologies. These same technologies can also be used for production solutions. So, it’s not that “things are done differently and more quickly in Hadoop”. Instead, it’s more appropriate to say that experimentation is different than an IT project, regardless of technology.

Yes, we should do everything we can to reduce unnecessary paperwork and to speed up delivery using proper objective setting, scoping, and agile development techniques. But that is different than abandoning rigor altogether. In fact, using newer technologies in IT projects requires more attention to detail, not less, because we have to take the maturity of the technology into consideration. Can it meet the service level needs of a particular solution? This needs to be asked and examined formally within the project.

Attempting to build production solutions using ad-hoc, experimental data preparation and analysis techniques is like building a modern skyscraper with a grass hut mentality. It just doesn’t make any sense.

Guest Blogger Kevin Lewis is responsible for Teradata’s Strategy and Governance practice. Prior to joining Teradata in 2007, he was responsible for initiating and leading enterprise data management at Publix Super Markets. Since joining Teradata, he has advised dozens of clients in all major industries.