database

Why We Love Presto

Posted on: June 24th, 2015 by Daniel Abadi No Comments

 

Concurrent with acquiring Hadoop companies Hadapt and Revelytix last year, Teradata opened the Teradata Center for Hadoop in Boston. Teradata recently announced that a major new initiative of this Hadoop development center will include open-source contributions to a distributed SQL query engine called Presto. Presto was originally developed at Facebook, and is designed to run high performance, interactive queries against Big Data wherever it may live --- Hadoop, Cassandra, or traditional relational database systems.

Among those people who will be part of this initiative and contributing code to Presto include a subset of the Hadapt team that joined Teradata last year. In the following, we will dive deeper into the thinking behind this new initiative from the perspective of the Hadapt team. It is important to note upfront that Teradata’s interest in Presto, and the people contributing to the Presto codebase, extends beyond the Hadapt team that joined Teradata last year. Nonetheless, it is worthwhile to understand the technical reasoning behind the embrace of Presto from Teradata, even if it presents a localized view of the overall initiative.

Around seven years ago, Ashish Thusoo and his team at Facebook built the first SQL layer over Hadoop as part of a project called Hive. At its essence, Hive was a query translation layer over Hadoop: it received queries in a SQL-like language called Hive-QL, and transformed them into a set of MapReduce jobs over data stored in HDFS on a Hadoop cluster. Hive was truly the first project of its kind. However, since its focus was on query translation into the existing MapReduce query execution engine of Hadoop, it achieved tremendous scalability, but poor efficiency and performance, and ultimately led to a series of subsequent SQL-on-Hadoop solutions that claimed 100X speed-ups over Hive.

Hadapt was the first such SQL-on-Hadoop solution that claimed a 100X speed-up over Hive on certain types of queries. Hadapt was spun out of the HadoopDB research project from my team at Yale and was founded by a group of Yale graduates. The basic idea was to develop a hybrid system that is able to achieve the fault-tolerant scalability of the Hive MapReduce query execution engine while leveraging techniques from the parallel database system community to achieve high performance query processing.

The intention of HadoopDB/Hadapt was never to build its own query execution layer. The first version of Hadapt used a combination of PostgreSQL and MapReduce for distributed query execution. In particular, the query operators that could be run locally, without reliance on data located on other nodes in the cluster, were run using PostgreSQL’s query operator set (although Hadapt was written such that PostgreSQL could be replaced by any performant single-node database system). Meanwhile, query operators that required data exchange between multiple nodes in the cluster were run using Hadoop’s MapReduce engine.

Although Hadapt was 100X faster than Hive for long, complicated queries that involved hundreds of nodes, its reliance on Hadoop MapReduce for parts of query execution precluded sub-second response time for small, simple queries. Therefore, in 2012, Hadapt started to build a secondary query execution engine called “IQ” which was intended to be used for smaller queries. The idea was that all queries would be fed through a query-analyzer layer before execution. If the query was predicted to be long and complex, it would be fed through Hadapt’s original fault-tolerant MapReduce-based engine. However, if the query would complete in a few seconds or less, it would be fed to the IQ execution engine.

presto graphic blogIn 2013 Hadapt integrated IQ with Apache Tez in order avoid redundant programming efforts, since the primary goals of IQ and Tez were aligned. In particular, Tez was designed as an alternative to MapReduce that can achieve interactive performance for general data processing applications. Indeed, Hadapt was able to achieve interactive performance on a much wider-range of queries when leveraging Tez, than what it was able to achieve previously.

Figure 1: Intertwined Histories of SQL-on-Hadoop Technology

Unfortunately Tez was not quite a perfect fit as a query execution engine for Hadapt’s needs. The largest issue was that before shipping data over the network during distributed operators, Tez first writes this data to local disk. The overhead of writing this data to disk (especially when the size of the intermediate result set was large) precluded interactivity for a non-trivial subset of Hadapt’s query workload. A second problem is that the Hive query operators that are implemented over Tez use (by default) traditional Volcano-style row-by-row iteration. In other words, a single function-invocation for a query operator would process just a single database record. This resulted in a larger number of function calls required to process a large dataset, and poor instruction cache locality as the instructions associated with a particular operator were repeatedly reloaded into the instruction cache for each function invocation. Although Hive and Tez have started to alleviate this issue with the recent introduction of vectorized operators, Hadapt still found that query plans involving joins or SQL functions would fall back to row-by-row iteration.

The Hadapt team therefore decided to refocus its query execution strategy (for the interactive query part of Hadapt’s engine) to Presto, which presented several advantages over Tez. First, Presto pipelines data between distributed query operators directly, without writing to local disk, significantly improving performance for network-intensive queries. Second, Presto query operators are vectorized by default, thereby improving CPU efficiency and instruction cache locality. Third, Presto dynamically compiles selective query operators to byte code, which lets the JVM optimize and generate native machine code. Fourth, it uses direct memory management, thereby avoiding Java object allocations, its heap memory overhead and garbage collection pauses. Overall, Presto is a very advanced piece of software, and very much in line with Hadapt’s goal of leveraging as many techniques from modern parallel database system architecture as possible.

The Teradata Center for Hadoop has thus fully embraced Presto as the core part of its technology strategy for the execution of interactive queries over Hadoop. Consequently, it made logical sense for Teradata to take its involvement in the Presto to the next level. Furthermore, Hadoop is fundamentally an open source project, and in order to become a significant player in the Hadoop ecosystem, Teradata needs to contribute meaningful and important code to the open source community. Teradata’s recent acquisition of Think Big serves as further motivation for such contributions.

Therefore Teradata has announced that it is committed to making open source contributions to Presto, and has allocated substantial resources to doing so. Presto is already used by Silicon Valley stalwarts Facebook, AirBnB, NetFlix, DropBox, and Groupon. However, Presto’s enterprise adoption outside of silicon valley remains small. Part of the reason for this is that ease-of-use and enterprise features that are typically associated with modern commercial database systems are not fully available with Presto. Missing are an out-of the-box simple-to-use installer, database monitoring and administration tools, and third-party integrations. Therefore, Teradata’s initial contributions will focus in these areas, with the goal of bridging the gap to getting Presto widely deployed in traditional enterprise applications. This will hopefully lead to more contributors and momentum for Presto.

For now, Teradata’s new commitments to open source contributions in the Hadoop ecosystem are focused on Presto. Teradata is only committing to contribute a small amount of Hadapt code to open source --- in particular those parts that will further the immediate goal of transforming Presto into an enterprise-ready, easy-to-deploy piece of software. However, Teradata plans to monitor Presto’s progress and the impact of Teradata contributions. Teradata may ultimately decide to contribute more parts of Hadapt to the Hadoop open source community. At this point it is too early to speculate how this will play out.

Nonetheless, Teradata’s commitment to Presto and its commitment to making meaningful contributions to an open source project is an exciting development. It will likely have a significant impact on enterprise-adoption of Presto. Hopefully, Presto will become a widely used open source parallel query execution engine --- not just within the Hadoop community, but due to the generality of its design and its storage layer agnosticism, for relational data stored anywhere.

==================================================================================

daniel abadi crop BLOG bio mgmtDaniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). http://twitter.com/#!/daniel_abadi.

Agile Data Warehousing Meets Agile Technology

Posted on: April 20th, 2015 by Guest Blogger No Comments

 

By Youko Watari, Technical Marketing Manager

Agile development and its 12 principles[1] have been adopted everywhere, and data warehousing is not an exception. The Agile data warehousing publications out there mostly focus on how to adopt Agile tools and techniques such as the backlog, user stories, and iterations into the data warehouse development process, or how to adopt Agile in the data model design process. One key aspect, however, I see missing in them is how important it is to choose the right data warehousing technologies in order to maximize the success in Agile data warehousing.

Considering the technology aspect of data warehousing is an important point because the successful Agile transformation must occur in the organization as a whole. I believe in order to adopt Agile development for your data warehousing, your foundational infrastructure including the use of hardware and database software must also be agile. A few obvious examples of the agile-ready technologies would be cloud technology and Software-as-a-Service (SaaS).

Another key technology trend of late that is very much in line with the spirit of Agile is software-defined anything (SDx), which provides dynamic and faster ways of configuring and delivering systems. I believe using software-defined technology within your data warehouse environment is well suited for supporting the collaborative, fast-paced, and time-boxed Agile development framework. Looking back at some of my past analytics and data warehousing projects, I can see that software-defined warehouse capabilities can solve some of the common infrastructure challenges we all may be familiar with, such as:

  • “We can’t use the Production server and its data as a sandbox or for prototyping.” Usually if you want to explore new possibilities in data analysis by combining existing production data with new untested data sources, the development team needs to request for a new sandbox and go through a pain of loading the data. Instead, why not use a software-controlled data lab capability and carve out a sandbox within the Production system? This will shorten the design and development time and increase the level of collaboration between the development team and stakeholders since the data they use and see is real. This technology could also be very useful if the team is running a risk-based spike in which new and exploratory analytics are tested in order to reduce late failures.

 

  • “We can’t start development for a few months because it will take a long time to provision a new development server.” A common approach for setting up a new development or test environment is to provision a physical server dedicated to this purpose. This may be an easier way to prevent possibilities for security and access violations of Production data or resource contention with the Production users. Drawbacks of this approach however are timeliness needed for Agile development and cost effectiveness. Instead, consider creating secure development and test environments with fine-grained resource and workload management capability within the existing warehousing system using software controls. This will free the project from unnecessary logistics constraints in the project timeline.

 

  • “We need to purchase a powerful system now so that it will accommodate all the users and usage down the road, but we are cost strapped.” One of the key benefits of adopting Agile is being able to realize faster ROI compared with the traditional waterfall style projects because it allows faster deliveries of business value in small increments. The data warehouse system should also adopt this approach and achieve lower TCO and faster ROI by using a database technology that allows the system capacity delivered just in time with a scalability that is graceful and liner. At the same time, a software control should allow managing CPU, I/O, and task priorities at a finer level so that user experience will stay consistent through the growth. An example of this would be setting a minimum response time for workloads so the user perception of the performance will stay consistent regardless of the resource availability and system growth.

Agile data warehousing is not a pipe dream. However you must equip your team with the right understanding of the Agile framework as well as with the right technology that is primed for agility. Agility in data warehousing is achievable today through the software-defined warehouse.

Youko Watari is the Technical Marketing Manager at Teradata responsible for Teradata Database and other core software products. She is a certified Project Management Professional (PMP®), PMI-Agile Certified Practitioner (PMI-ACP) ®, and Certified Scrum Product Owner® (CSPO).

[1] “The Agile Manifesto” © 2001, Kent Beck, et al (http://www.agilealliance.org/the-alliance/the-agile-manifesto/)

 

 

It is well-known that there are two extreme alternatives for storing database tables on any storage media: storing it row-by-row (as done by traditional “row-store” technology) or storing it column-by-column (as done by recently popular “column-store” implementations). Row-stores store the entire first row of the table, followed by the entire second row of the table, etc. Column-stores store the entire first column of the table, followed by the entire second column of the table, etc. There have been huge amounts of research literature and commercial whitepapers that discuss the various advantages of these alternative approaches, along with various proposals for hybrid solutions (which I discussed in more detail in my previous post).

abadi blog clamp image abadiDespite the many conflicting arguments in favor of these different approaches, there is little question that column-stores compress data much better than row-stores. The reason is fairly intuitive: in a column-store, entire columns are stored contiguously --- in other words, a series of values from the same attribute domain are stored consecutively. In a row-store, values from different attribute domains are interspersed, thereby reducing the self-similarity of the data. In general the more self-similarity (lower entropy) you have in a dataset, the more compressible it is. Hence, column-stores are more compressible than row-stores.

In general, compression rates are very sensitive to the particular dataset that is being compressed. Therefore it is impossible to make any kind of guarantees about how much a particular database system/compression algorithm will compress an arbitrary dataset. However, as a general rule of thumb, it is reasonable to expect around 8X compression if a column-store is used on many kinds of datasets. 8X compression means that the compressed dataset is 1/8th the original size, and scan-based queries over the dataset can thus proceed approximately 8 times as fast. This stellar compression and resulting performance improvements are a major contributor to the recent popularity of column-stores.

It is precisely this renowned compression of column-stores which makes the compression rate of RainStor (a recent Teradata acquisition) so impressive in comparison. RainStor claims a factor of 5 times more compression than what column-stores are able to achieve on the same datasets, and 40X compression overall.

Although the reason why column-stores compress data better than row-stores is fairly intuitive, the reason why RainStor can compress data better than column-stores is less intuitive. Therefore, we will now explain this in more detail.

Take for example the following table, which is a subset of a table describing orders from a particular retail enterprise that sells bicycles and related parts. (A real table would have many more rows and columns, but we keep this example simple so that it is easier to understand what is going on).

Record Order date Ship date Product Price
1 03/22/2015 03/23/2015 “bicycle” 300
2 03/22/2015 03/24/2015 “lock” 18
3 03/22/2015 03/24/2015 “tire” 70
4 03/22/2015 03/23/2015 “lock” 18
5 03/22/2015 03/24/2015 “bicycle” 250
6 03/22/2015 03/23/2015 “bicycle” 280
7 03/22/2015 03/23/2015 “tire” 70
8 03/22/2015 03/23/2015 “lock” 18
9 03/22/2015 03/24/2015 “bicycle” 280
10 03/23/2015 03/24/2015 “lock” 18
11 03/23/2015 03/25/2015 “bicycle” 300
12 03/23/2015 03/24/2015 “bicycle” 280
13 03/23/2015 03/24/2015 “tire” 70
14 03/23/2015 03/25/2015 “bicycle” 250
15 03/23/2015 03/25/2015 “bicycle” 280

 

The table contains 15 records and shows four attributes --- the order and ship dates of a particular product; the product that was purchased, and the purchase price. Note that there is a relationship between some of these columns --- in particular the ship date is usually 1 or 2 days after the order date, and that the price of various products are usually consistent across orders, but there may be slight variations in price depending on what coupons the customer used to make the purchase.

A column-store would likely use “run-length encoding” to compress the order date column. Since records are sorted by order date, this would compress the column to its near-minimum --- it can be compressed as (03/22/2015, 9); (03/23/2015, 6) --- which indicates that 03/22/2015 is repeated 9 straight times, followed by 03/23/2015 which is repeated 6 times. The ship date column, although not sorted, is still very compressible, as each value can be expressed using a small number of bits in terms of how much larger (or smaller) it is from the previous value in the column. However, the other two columns --- product and price --- would likely be compressed using a variant of dictionary compression, where each value is mapped to the minimal number of bits needed represent it. For large datasets, where there are many unique values for price (or even for product), the number of bits needed to represent a dictionary entry is non-trivial, and the same dictionary entry is repeated in the compressed dataset for every repeated value in the original dataset.

In contrast, in RainStor, every unique value in the dataset is stored once (and only once), and every record is represented as a binary tree, where a breadth-first traversal of the tree enables the reconstruction of the original record. For example, the table shown above is compressed in RainStor using the forest of binary trees shown below. There are 15 binary trees (each of the 15 roots of these trees are shown using the green circles at the top of the figure), corresponding to the 15 records in the original dataset.abadi forest trees blog

Forest of Binary Trees Compression

For example, the binary tree corresponding to record 1 is shown on the left side of the figure. The root points to two children --- the internal nodes “A” and “E”. In turn, node “A” points to 03/22/2015 (corresponding to the order date of record 1), and to 03/23/2015 (corresponding to the ship date of record 1). Node “E” points to “bicycle” (corresponding to the product of record 1) and “300” corresponding to the price of record 1).

Note that records 4, 6, and 7 also have an order date of 03/22/2015 and a ship date of 03/23/2015. Therefore, the roots of the binary trees corresponding to those records also point to internal node “A”. Similarly, note that record 11 also is associated with the purchase of a bicycle for $300. Therefore, the root for record 11 also points to internal node “E”.

These shared internal nodes are what makes RainStor’s compression algorithm fundamentally different from any algorithm that a column-store is capable of performing. Column-stores are forced to create dictionaries and search for patterns only within individual columns. In contrast, RainStor’s compression algorithm finds patterns across different columns --- identifying the relationship between ship date and order date and the relationship between product and price, and leveraging these relationships to share branches in the trees that are formed, thereby eliminating redundant information. RainStor thus has fundamentally more room to search for patterns in the dataset and compress data by referencing these patterns via the (compressed) location of the root of the shared branch.

For a traditional archiving solution, compression rate is arguably the most important feature (right up there with immutability). Indeed, RainStor’s compression algorithm enables it to be used for archival use-cases, and RainStor provides all of the additional features you would expect from an archiving solution: encryption, LDAP/AD/PAM/Kerberos/PCI authentication and security, audit trails and logging, retention rules, expiry policies, and integrated implementation of existing compliance standards (e.g. SEC 17a-4).

However, what brings RainStor to the next level in the archival solutions market is that it is an “active” archive, meaning that the data that is managed by RainStor can be queried at high performance. RainStor provides a mature SQL stack for native querying of compressed RainStor data, including ANSI SQL 1992 and 2003 parsers, and a full MPP query execution engine. For enterprises with Hadoop clusters, RainStor is fully integrated with the Cloudera and Hortonworks distributions of Hadoop --- RainStor compressed data files can be partitioned over a HDFS cluster, and queried in parallel with HiveQL (or MapReduce or Pig). Furthermore, RainStor integrates with YARN for resource management, with HCatalog for metadata management, and with Ambari for system monitoring and management.

The reason why most archival solutions are not “active” is that the compression algorithms used to reduce the data size before archival are so heavy-weight, that significant processing resources must be invested in decompressing the data before it can be queried. Therefore, it is preferable to leave the data archived in compressed form, and only decompress it at times of significant need. In general, a user should expect significant query performance reductions relative to querying uncompressed data, in order to account for the additional decompression time.

The beauty of RainStor’s compression algorithm is that even though it gets compression ratios comparable to other archival products, its compression algorithm is not so heavy-weight that the data must be decompressed prior to querying it. In particular, the binary tree structures shown above are actually fairly straightforward to perform query operations on directly, without requiring decompression prior to access. For example, a count distinct or a group-by operation can be performed via a scan of the leaves of the binary tees. Furthermore, selections can be performed via a reverse traversal of the binary trees from the leaves that match the selection predicate. In general, since there is a one-to-one mapping of records in the uncompressed dataset to the binary trees in RainStor’s compressed files, all query operations can be expressed in terms of operations on these binary trees. Therefore, RainStor queries can benefit from the I/O improvement of scanning in less data (due to the smaller size of the compressed files on disk/memory) without paying the decompression cost to fully decompress these compressed files after they are read from storage. This leads to RainStor’s claims of 2X-100X performance improvement on most queries --- an industry-leading claim in the archival market.

In short, RainStor’s strong claims around compression and performance are backed up by the technology that is used under the covers. Its compression algorithm is able to identify and remove redundancy both within and across columns. Furthermore, the resulting data structures produced by the algorithm are amenable to direct operation on the compressed data. This allows the compressed files to be queried at high performance, and positions RainStor as a leading active-archive solution.

_________________________________________________________________________

daniel abadi crop BLOG bio mgmt

Daniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). http://twitter.com/#!/daniel_abadi.

 

PART FIVE: This is the last blog in my series about Near Real Time data acquisition from SAP. This final blog is co-written with Arno Luijten, who is one of Teradata’s lead engineers. He is instrumental in demystifying the secrets of the elusive SAP clustered and pooled tables.

There is a common misconception that the Pool and Cluster tables in SAP R/3 can only be deciphered by the SAP R/3 application server, giving them an almost mythical status. The phrase that is used all over the ‘help sites’ and forums is “A clustered and a pooled table cannot be read from outside SAP because certain data are clustered and pooled in one field”… which makes replicating these tables pretty pointless – right?

But what exactly are Pooled and Cluster tables in SAP R/3 anyway? We thought we would let SAP give us the answer and searched their public help pages (SAP Help Portal). But that yielded limited results, so we looked further -- (Googled cluster tables) and found the following explanation (Technopedia-link):

“Cluster tables are special types of tables present in the SAP data dictionary. They are logical tables maintained as records of the normal SAP tables, which are commonly known as transparent tables. A key advantage of using cluster tables is that data is stored in a compressed format, reducing memory space and the landscape network load for retrieving information from these tables.”

Reading further on the same page, there are six major bullet points describing the features - of which fiveof them basically tell you that what we did cannot be done. Luckily, we didn’t let this phase us!

We agree: the purpose of SAP cluster tables is to save space because of the huge volume of the data contained in these tables and the potential negative impact that this may have on the SAP R/3 application. We know this because the two most (in-) famously large Cluster tables are RFBLG and KOCLU which contain the financial transactions and price conditions. SAP’s ABAP programmers often refer to BSEG (for financial transactions) and KONV (for the price-conditions).

From the database point of view, these tables do not exist but are contained in the physical tables named RFBLG and KOCLU. Typically these (ABAP) tables contain a lot of data. There are more tables set up in this way, but from a data warehousing point of view these two tables are probably the most relevant. Simply skipping these tables would not be a viable option for most clients.

Knowing the importance of the Pool and Cluster table, the value of data replication, and the value of operational analytics, we forged ahead with a solution. The encoded data from the ABAP table is stored as a BLOB “Binary Large Object” in the actual cluster table. To decode the data in the BLOB we wrote a C++ program as a Teradata User Defined Function (UDF) which we call the “Decoder” and it is installed directly within the Teradata database.

There can be a huge volume of data present in the cluster tables (hence the usage of cluster logic) and as a result decoding can be a lot of work and can have an impact on the performance of the SAP system. Here we have an extra advantage over SAP R/3 because the Decoder effectively allows us to bypass the ABAP layer and use the power of the Teradata Database. Our MPP capabilities allow decoding to be done massively faster than the SAP application, so decoding the RFBLG/KOCLU tables in Teradata can save a lot of time.

Over the last few months I have written about data replication starting with a brief SAP history, I questioned real-time systems, and I have written about the benefits of data replication and how it is disruptive to analytics for SAP R/3.

In my last blog I looked at the technical complexities we have had to overcome to build a complete data replication solution into Teradata Analytics for SAP® Solutions. It has not been a trivial exercise - but the benefits are huge!

Our Data Replication capability enables operational reporting and managerial analytics from the same source; it increases flexibility, significantly reduces the burden on the SAP R/3 system(s), and of course, delivers SAP data in near-real time for analytics.

Hybrid Row-Column Stores: A General and Flexible Approach

Posted on: March 10th, 2015 by Daniel Abadi No Comments

 

During a recent meeting with a post-doc in my lab at Yale, he reminded me that this summer will mark the 10-year anniversary of the publication of C-Store in VLDB 2005. C-Store was by no means the first ever column-store database system (the column-store idea has been around since the 70s --- nearly as long as relational database systems), but it was quite possibly the first proposed architecture of a column-store designed for petabyte-scale data analysis. The C-Store paper has been extremely influential, with close to every major database vendor developing column-oriented extensions to their core database product in the past 10 years, with most of them citing C-Store (along with other influential systems) in their corresponding research white-papers about their column-oriented features.

Given my history with the C-Store project, I surprised a lot of people when some of my subsequent projects such as HadoopDB/Hadapt did not start with a column-oriented storage system from the beginning. For example, industry analyst Curt Monash repeatedly made fun of me on this topic (see, e.g. http://www.dbms2.com/2012/10/16/hadapt-version-2/).

In truth, my love and passion for column-stores has not diminished since 2005. I still believe that every analytical database system should have a column-oriented storage option. However, it is naïve to think that column-oriented storage is always the right solution. For some workloads --- especially those that scan most rows of a table but only a small subset of the columns, column-stores are clearly preferable. On the other hand, there any many workloads that contain very selective predicates and require access to the entire tuple for those rows which pass the predicate. For such workloads, row-stores are clearly preferable.

abadi new March 10 first graphicThere is thus general consensus in the database industry that a hybrid approach is needed. A database system should have both column-oriented and row-oriented storage options, and the optimal storage can be utilized depending on the expected workload.

Despite this consensus around the general idea of the need for a hybrid approach, there is a glaring lack of consensus about how to implement the hybrid approach. There have been many different proposals for how to build hybrid row/column-oriented database systems in the research and commercial literature. A sample of such proposals include:

(1) A fractured mirrors approach where the same data is replicated twice --- once in a column-oriented storage layer and once in a row-oriented storage layer. For any particular query, data is extracted from the optimal storage layer for that query, and processed by the execution engine.
(2) A column-oriented simulation within a row-store. Let’s say table X contains n columns. X gets replaced by n new tables, where each new table contains two columns --- (1) a row-identifier column and (2) the column values for one of the n columns in the original table. Queries are processed by joining together on the fly the particular set of these two-column tables that correspond to the columns accessed by that query.
(3) A “PAX” approach where each page/block of data contains data for all columns of a table, but data is stored column-by-column within the page/block.
(4) A column-oriented index approach where the base data is stored in a row-store, but column-oriented storage and execution can be achieved through the use of indexes.
(5) A table-oriented hybrid approach where a database administrator (DBA) is given a choice to store each table row-by-row or column-by-column, and the DBA makes a decision based on how they expect the tables to be used.
In the rest of this post, I will overview Teradata’s elegant hybrid row/column-store design and attempt to explain why I believe it is more flexible than the above-mentioned approaches.

The flexibility of Teradata’s approach is characterized by three main contributions.

1: Teradata views the row-store vs. column-store debate as two extremes in a more general storage option space.

The row-store extreme stores each row continuously on storage and the column-store extreme stores each column continuously on storage. In other words, row-stores maintain locality of horizontal access of a table, and column-stores maintain locality of vertical access of table. In general however, the optimal access-locality could be on a rectangular region of a table.

adadi second graphic March 10

Figure 1: Row and Column Stores (uncompressed)

To understand this idea, take the following example. Many workloads have frequent predicates on date attributes. By partitioning the rows of a table according to date (e.g. one partition per day, week, month, quarter, or year), those queries that contain predicates on date can be accelerated by eliminating all partitions corresponding to dates outside the range of the query, thereby efficiently utilizing I/O to read in data from the table from only those partitions that have data matching the requested data range.

However, different queries may analyze different table attributes for a given date range. For example, one query may examine the total revenue brought in per store in the last quarter, while another query may examine the most popular pairs of widgets bought together in each product category in the last quarter. The optimal storage layout for such queries would be to have store and revenue columns stored together in the same partition, and to have product and product category columns stored together in the same partition. Therefore we want both column-partitions (store and revenue in one partition and product and product category in a different partition) and row-partitions (by date).

This arbitrary partitioning of a table by both rows and columns results in a set of rectangular partitions, each partition containing a subset of rows and columns from the original table. This is far more flexible than a “pure” column-store that enforces that each column be stored in a different physical or virtual partition.

Note that allowing arbitrary rectangular partitioning of a table is a more general approach than a pure column-store or a pure row-store. A column-store is simply a special type of rectangular partitioning where each partition is a long, narrow rectangle around a single column of data. Row-oriented storage can also be simulated with special types of rectangles. Therefore, by supporting arbitrary rectangular partitioning, Teradata is able to support “pure” column-oriented storage, “pure” row-oriented storage, and many other types of storage between these two extremes.

2: Teradata can physically store each rectangular partition in “row-format” or “column-format.”

One oft-cited advantage of column-stores is that for columns containing fixed-width values, the entire column can be represented as a single array of values. The row id for any particular element in the array can be determined directly by the index of the element within the array. Accessing a column in an array-format can lead to significant performance benefits, including reducing I/O and leveraging the SIMD instruction set on modern CPUs, since expression or predicate evaluation can occur in parallel on multiple array elements at once.

Another oft-cited advantage of column-stores (especially within my own research --- see e.g. http://db.csail.mit.edu/projects/cstore/abadisigmod06.pdf ) is that column-stores compress data much better than row-stores because there is more self-similarity (lower entropy) of data within a column than across columns, since each value within a column is drawn from the same attribute domain. Furthermore, it is not uncommon to see the same value repeat multiple times consecutively within a column, in which case the column can be compressed using run-length encoding --- a particularly useful type of compression since it can both result in high compression ratios and also is trivial to operate on directly, without requiring decompression of the data.

Both of these advantages of column-stores are supported in Teradata when the column-format is used for storage within a partition. In particular, multiple values of a column (or a small group of columns) are stored continuously in an array within a Teradata data structure called a “container”. Each container comes with a header indicating the row identifier of the first value within the container, and the row identifiers of every other value in the container can be deduced by adding their relative position within the container to the row identifier of the first value. Each container is automatically compressed using the optimal column-oriented compression format for that data, including run-length encoding the data when possible.

abadi third graphic March 10

Figure 2: Column-format storage using containers.

However, one disadvantage of not physically storing the row identifier next to each value is that extraction of a value given a row identifier requires more work, since additional calculations must be performed to extract the correct value from the container. In some cases, these additional calculations involve just positional offsetting; however, in some cases, the compressed bits of the container have to be scanned in order to extract the correct value. Therefore Teradata also supports traditional row-format storage within each partition, where the row identifier is explicitly stored alongside any column values associated with that row. When partitions are stored using this “row format”, Teradata’s traditional mechanisms for quickly finding a row given a row identifier can be leveraged.

In general, when the rectangular partitioning scheme results in wide rectangles, row format storage is recommended, since the overhead of storing the row id with each row is amortized across the breadth of the row, and the benefits of array-oriented iteration through the data are minimal. But when the partitioning scheme results in narrow rectangles, column-format storage is recommended, in order to get the most out of column-oriented array iteration and compression. Either way --- having a choice between row format and column format for each partition further improves the flexibility of Teradata’s row/columnar hybrid scheme.

3: Teradata enables traditional primary indexing for quick row-access even when column-oriented partitioning is used.

Many column-stores do not support primary indexes due to the complexity involved in moving around records as a result of new inserts into the index. In contrast, Teradata Database 15.10 supports two types of primary indexing when a table has been partitioned to AMPs (logical servers) by the hash value of the primary index attribute. The first, called CPPI, maintains all row and column partitions on an AMP sorted by the hash value of the primary index attribute. These hash values are stored within the row identifier for the record, which enables each column partition to independently maintain the same sort order without explicitly communicating with each other. Since the data is sorted by the hash of the primary index attribute, finding particular records for a given value of the primary index attribute is extremely fast. The second, called CPPA, does not sort by the hash of the primary index attribute --- therefore the AMP that contains a particular record can be quickly identified given a value of the primary index attribute. However, further searching is necessary within the AMP to find the particular record. This searching is limited to the non-eliminated, nonempty column and row partitions. Finding a particular record given a row id for both CPPI and CPPA is extremely fast since, in either case, the records are in row id order.

Combined, these three features make Teradata’s hybrid solution to the row-store vs. column-store tradeoff extremely general and flexible. In fact, it’s possible to argue that there does not exist a more flexible hybrid solution from a major vendor on the market. Teradata has also developed significant flexibility inside its execution engine --- adapting to column-format vs. row-format input automatically, and using optimal query execution methods depending on the format-type that a particular query reads from.
====================================================================================

daniel abadi crop BLOG bio mgmt

Daniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). http://twitter.com/#!/daniel_abadi.

 

Are you a business person or executive involved in a data warehouse project where the term “normalization” keeps coming up but you have no idea what they (the technical IT folks) mean? You have heard them talk about “third” normal form and wonder if it is some new health fad or yoga position.

In my prior blog “Modeling the Data” I talked about how data integration is necessary to address many of your business priorities and that one of the first steps in data integration is to organize your data into tables. A “data model” is a graphical representation of that organization which serves as a communication tool between and within the business and IT as shown below.

Data Model Example – Accounts and Individuals Reflect Business Rules

Data Model Example – Accounts and Individuals Reflect Business Rules

Normalization

So now we get to normalization. Normalization is the process that one goes through to decide in which table a type of data belongs. Let’s take a simple example. I have two tables – one contains loan account information and another contains information about individuals who may be customers (see above Figure). I have a data type called “birth date.” During the normalization process I will ask “What does this data type describe?” Does it describe the account or does it describe an individual? This answer is simple – it describes individuals. You may think that this is a piece of cake. Well, not so fast. Which table is the best fit for the data type “birth date” may be obvious to us, but many times the “best table fit” for a type of data may not be so obvious and hence you need definitions for those data types.

One example of an ambiguous data type is “balance.” Does this “balance” describe a point in time for an account? Or does it describe the sum of the balances for a group of accounts at a point in time? Maybe it should be “average balance over a time period.” Maybe it is high balance or low balance or a limit at a point in time. Maybe it is the cleared balance or a ledger balance. Maybe it is a summation of all the deposit balances held by one person at a point in time. A data model is not complete unless all its components (tables and columns) have definitions.

The normalization process can get more involved when we talk about first, second and third normal forms (and sometimes fourth and fifth). Using the birth date example, if the type of data (e.g. birth date) describes the complete meaning of the table then it is third normal form. In the above data model example, if I put birth date into the INDIVIDUAL ACCOUNT table then that would not be in third normal form because the birth date describes only part of the meaning of that table – the individual part. In this case it would be in only second normal form. By putting birth date into the INDIVIDUAL table it is in third normal form because it describes the complete meaning of the table. In most cases we take a model to third normal form but not fourth or fifth.

Why Normalize Your Data?

Why is it important to normalize your data? There are two basic reasons. (1) The first is to eliminate redundancy. When you bring your data together from different sources you will inevitably have duplications in data values for the same data type across the source systems. One example is the same person may have their name spelled differently on a loan account versus a deposit account. That person does not have two names, the name just needs to be represented in one place with one value in the right place in the integrated database. (2) The second reason is to make sure that the data is organized into tables in a way that reflects the business rule – our example of birth date describing the individual and not the account. Putting data where they logically belong will make it easier and more cost effective to maintain over the long term.

In Summary

So the next time someone brings up the concept of normalization think about the buckets of data you have in the enterprise, how you need to bring it all together so you can answer those tough business questions. Finally, when you bring it together, you need to eliminate redundancy and organize data in a logical way that makes sense to the business so that your efforts and design will last over the long term. Normalization is one of the processes to get you there.

Kalthoff Work resized Photo (2)

Nancy Kalthoff is the product manager and original data architect for the Teradata financial services data model (FSDM) for which she received a patent. She has been in IT over 30 years and with Teradata over 20 years concentrating on data architecture, business analysis, and data modeling.

Integrate Data, Processes & People: End Data Ownership Turf Wars

Posted on: November 18th, 2014 by Guest Blogger No Comments

 

The biggest thing I’ve realized over the past couple of months, other than Tony Romo is one of the best NFL players I’ve ever seen -- is the fact that data ownership is still a huge problem for companies of all types. Tony’s numbers keep getting more impressive game by game – and likewise the pace of data streaming into organizational information channels rises by the hour.

Romo aside, the dramatic growth of data volume only seems to rekindle data ownership issues among so many internal departments, who don’t or won’t see the advantages of sharing information. Maybe call it data ‘possessiveness.’ I really thought, and I have said in numerous presentations over the summer, that we’d solved the data ownership problem. To begin with, the industry seems to have understood the value of data integration and optimization, and a ‘single view of the customer,’ once just a hazy vision in the distance, was now becoming a technological reality, so isn’t ‘data sharing’ a no-brainer?

But after presentations at a few conferences this fall, including Teradata Partners, folks have come to say things like, “Everyone in all of our marketing areas wants data from the Customer Insights group -- but they won’t play nice and share it.” Others have made similar complaints at reception chats or over lunch. These little ‘data dramas’ and ‘turf tussles’ surprise me.

What also continues to surprise me is the lack of marketing participation at many IT gatherings. When I lead a session on how to use big data in marketing, the audience is usually 90% data scientists or IT specialists and 10% marketing. Sunday, in my well-attended session at the Partners event, it was zero marketers. What’s up with that? Where are those savvy ‘data-driven’ marketers?

On to Monday, where we held a lunch for 30 retailers and CPG firms. Other than the one analyst from IDC, the rest were from companies like Safeway, Target, HEB, Williams & Sonoma, Hallmark, etc. While some work in marketing, none viewed themselves as ‘marketers’ – they were data scientists or IT specialists. Great people, truly interested in Dynamic Customer Strategy, and like Sunday’s session, it went very well. But again, no CMOs, no marketing directors, no merchandisers. Baffling!

Trend-watchers report that marketing is supposed to be the biggest spender on IT by 2017, outpacing the IT department. Somehow, the CMO is supposed to become the most proficient IT buyer on the planet. So when does the foundational due diligence take place?

Reading a few white papers or looking at where a particular solution is -- in some magic matrix -- is not sufficient. Someone thinks, “Oh, we can make money with marketing automation tools here. Let’s get one and get some data and go to work.” And then they demand the data from the data group, or from some other group so they can get their work done without thinking about the greater good. Sharing data always results in the greater good, right?

Data possessiveness has become the modern tragedy of the commons, a phrase coined to describe the overgrazing that would occur when everyone shared a common pasture (like the Boston commons).

In this modern-day tragedy, there are two outcomes. First comes technology bloat, and with technology bloat comes lots of little not-playing-well-with-others data sets and an insufficient data strategy. Maybe they can import the data in -- but not out.

In case you are unfamiliar with the term, technology bloat was coined by my former student and now consultant Ben Becker (beckerstrategies.com) to describe the common situation of multiple overlapping software solutions. In environments where data silos and turf battles over applications exist, technology bloat is a huge challenge for IT: Multiple systems to support when one would do, budget-crushing agreements when rationalization would be less expensive, and so on and so on.

That’s why I found it interesting that Michael Koehler, Teradata’s CEO, emphasized integration as the key watchword for 2015. Integration clearly is a play that works well for Teradata, especially with its full suite of solutions. But when marketing is spending more than IT on IT and doesn’t know how, there’s a tall challenge.

Another causal factor of technology bloat is ‘how’ marketing budgets and spends funds for IT. The budget to acquire may not even be an IT budget but come out of monies allocated to a particular program or profit center. The campaigns budget, for example, might be used to buy a campaign management tool. As long as revenue targets are hit, all is well from a budgetary perspective, at least as far as marketing management is concerned. Of course, no matter that it’s the third campaign management tool that the organization purchased.

Similarly, there's the revenue ownership problem. In spite of attribution modeling that can weight the effectiveness of each element in the marketing mix and apportion revenue accordingly, each profit center is unwilling to share revenue or customers. The result is customers who delete and ignore every marketing message from their former partners who now over-market because they won’t/can’t share data. In my department alone, I know of at least three different CRM systems.

Moreover, marketers just want to do marketing, and especially the cool marketing. I get that. It's fun to see marketing strategy actually lead to revenue, whether you're in B2C and actual sales are immediately triggered or B2B and the work is mostly above the funnel.

I suspect, though, that the problem is greater in B2B. When we did the study in retailing earlier this year, we were far less likely to identify data ownership as a bottleneck. Retailers are more mature than B2B in the whole data thing anyway, but there are also fewer marketing areas. B2B companies tend to be organized by product or vertical market and each operates as a separate business unit. Marketing departments or teams proliferate, and that leads to technology bloat etc.

While the simple answer might be that IT should be making more decisions, I don’t see that as realistic. And if Koehler is correct that we’re in for a period of integration, then I suspect that will mean consolidation of tools and applications into suites. What’s interesting to me is that Teradata seems to be the lone voice, even among full-suite providers, crying out to end technology bloat.

However, I agree that a period of integration is coming. When the CIO can demonstrate to the CMO how integration can improve revenue through better data and marketing strategies while reducing costs (and in that order), most CMOs will make that move.

Data possessiveness will fade as the benefits of sharing integrated data become ubiquitous and irresistible. Data dramas will cease, and marketers will more enthusiastically participate in IT conferences.

It’s called teamwork – something Tony Romo totally understands.

best tanner blog bio 1-1

 

 

 

Dr. Jeff R. Tanner is Professor of Marketing and the Executive Director of Baylor University’s Innovative Business Collaboratory. He regularly speaks at conferences such as CRM Evolution, Teradata Partners, Retail Technology, INFORMS, and others. Author or co-author of 15 books, including his newest, Analytics and Dynamic Customer Strategy, he is an active consultant to organizations such as Lawrence Livermore National Laboratory, Pearson-Prentice Hall, and Cabela’s.

 

It happens every few years and it’s happening again. A new technology comes along and a significant segment of the IT and business community want to toss out everything we’ve learned over the past 60 years and start fresh. We “discover” that we’ve been wasting time applying unnecessary rigor and bureaucracy to our projects. No longer should we have to wait three to six months or longer to deliver technical solutions to the business. We can turn these things around in three to six days or even less.

In the mid 1990’s, I was part of a team that developed a “pilot” object-oriented, client-server (remember when these were the hot buzzwords?) application to replenish raw materials for a manufacturing function. We were upending the traditional mainframe world by delivering a solution quickly and iteratively with a small team. When the end users started using the application in real life, it was clear they were going to rely on it to do their jobs every day. Wait, was this a pilot or…? I would come into work in the morning, walk into a special room that housed the application and database servers, check the logs, note any errors, make whatever fixes needed to be made, re-run jobs, and so on.

It wasn’t long before this work began to interfere with my next project, and the end users became frustrated when I wasn’t available to fix problems quickly. It took us a while and several conversations with operations to determine that “production” didn’t just mean “the mainframe”. “Production” meant that people were relying on the solution on a regular basis to do their jobs. So we backtracked and started talking about what kind of availability guarantees we could make, how backup and recovery should work, how we could transition monitoring and maintenance to operations, and so on. In other words, we realized what we needed was a traditional IT project that just happened to leverage newer technologies.

This same scenario is happening today with Hadoop and related tools. When I visit client organizations, a frightening number will have at least one serious person saying something like, “I really don’t think ‘data warehousing’ makes sense any more. It takes too long. We should put all our data in Hadoop and let our end users access whatever they want.” It is indeed a great idea to establish an environment that enables exploration and quick-turnaround analysis against raw data and production data. But to position this approach as a core data and analytics strategy is nothing short of professional malpractice.

The problem is that people are confusing experimentation with IT projects. There is a place for both, and there always has been. Experimentation (or discovery, research, ad-hoc analysis, or whatever term you wish to use) should have lightweight processes and data management practices – it requires prioritization of analysis activity, security and privacy policies and implementation, some understanding of available data, and so on, but it should not be overburdened with the typical rigor required of projects that are building solutions destined for production. Once a prototype is ready to be used on a regular basis for important business functions, that solution should be built through a rigorous IT project leveraging an appropriate – dare I say it – solution development life cycle (SDLC), along with a comprehensive enterprise architecture plan including, yes, a data warehouse that provides integrated, shared, and trusted production data.

An experimental prototype should never be “promoted” to a production environment. That’s what a project is for. Experimentation can be accomplished with Hadoop, relational technology, Microsoft Office, and many other technologies. These same technologies can also be used for production solutions. So, it’s not that “things are done differently and more quickly in Hadoop”. Instead, it’s more appropriate to say that experimentation is different than an IT project, regardless of technology.

Yes, we should do everything we can to reduce unnecessary paperwork and to speed up delivery using proper objective setting, scoping, and agile development techniques. But that is different than abandoning rigor altogether. In fact, using newer technologies in IT projects requires more attention to detail, not less, because we have to take the maturity of the technology into consideration. Can it meet the service level needs of a particular solution? This needs to be asked and examined formally within the project.

Attempting to build production solutions using ad-hoc, experimental data preparation and analysis techniques is like building a modern skyscraper with a grass hut mentality. It just doesn’t make any sense.

Guest Blogger Kevin Lewis is responsible for Teradata’s Strategy and Governance practice. Prior to joining Teradata in 2007, he was responsible for initiating and leading enterprise data management at Publix Super Markets. Since joining Teradata, he has advised dozens of clients in all major industries. 

Take a Giant Step with Teradata QueryGrid

Posted on: April 29th, 2014 by Dan Graham No Comments

 

Teradata 15.0 has gotten tremendous interest from customers and the press because it enables SQL access to native JSON data. This heralds the end of the belief that data warehouses can’t handle unstructured data. But there’s an equally momentous innovation in this release called Teradata QueryGrid.

What is Teradata QueryGrid?
In Teradata’s Unified Data Architecture (UDA), there are three primary platforms: the data warehouse, the discovery platform, and the data platform. The huge gray arrows represent data flowing between these systems. A year or two ago, these arrows were extract files moved in batch mode.

Teradata QueryGrid is both a vision and a technology. The vision --simply said-- is that a business person connected to the Teradata Database or Aster Database can submit a single SQL query that joins data together from a second system for analysis. There’s no need to plead with the programmers to extract data and load it into another machine. The business person doesn’t have to care where the data is – they can simply combine relational tables in Teradata with tables or flat files found in Hadoop on demand. Imagine a data scientist working on an Aster discovery problem and needing data from Hadoop. By simply adjusting the queries she is already using, Hadoop data is fetched and combined with tables in the Aster Database. That should be a huge “WOW” all by itself but let’s look further.

You might be saying “That’s not new. We’ve had data virtualization queries for many years.” Teradata QueryGrid is indeed a form of data virtualization. But Teradata QueryGrid doesn’t suffer from the normal limitations of data virtualization such as slow performance, clogged networks, and security concerns.

Today, the vision is translated into reality as connections between Teradata Database and Hadoop as well as Aster Databases and Hadoop. Teradata QueryGrid also connects the Teradata Data Warehouse to Oracle databases. In the near future, it will extend to all combinations of UDA servers such as Teradata to Aster, Aster to Aster, Teradata to Teradata, and so on.

Seven League Boots for SQL
With QueryGrid, you can add a clause in a SQL statement that says “Call up Hadoop, pass Hive a SQL request, receive the Hive results, and join it to the data warehouse tables.” Running a single SQL statement spanning Hadoop and Teradata is amazing in itself – a giant step forward. Notice too that all the database security, advanced SQL functions, and system management in the Teradata or Aster system is supporting these queries. The only effort required is for the database administrator to set up a “view” that connects the systems. It’s self-service for the business user after that. Score: complexity zero, business users one.

Parallel Performance, Performance, Performance
Historically, data virtualization tools lack the ability to move data between systems in parallel. Such tools send a request to a remote database and the data comes back serially through an Ethernet wire. Teradata QueryGrid is built to connect to remote systems in parallel and exchange data through many network connections simultaneously. Wanna move a terabyte per minute? With the right configurations it can be done. Parallel processing by both systems makes this incredibly fast. I know of no data virtualization system that does this today.

Inevitably, the Hadoop cluster will have a different number of servers compared to the Teradata or Aster MPP systems. The Teradata and Aster systems start the parallel data exchange by matching up units of parallelism between the two systems. That is, all the Teradata parallel workers (called AMPs) connect to a buddy Hadoop worker node for maximum throughput. Anytime the configuration changes, the workers match-up changes. This is non-trivial rocket-science class technology. Trust me – you don’t want to do this yourself and the worst situation would be to expose this to the business users. But Teradata QueryGrid does it all for you completely invisible to the user.

Put Data in the Data Lake FAST
Imagine complex predictive analytics using R® or SAS® are run inside the Teradata data warehouse as part of a merger and acquisition project. In this case, we want to pass this data to the Hadoop Data Lake where it is combined with temporary data from the company being acquired. With moderately simple SQL stuffed in a database view, the answers calculated by the Teradata Database can be sent to Hadoop to help finish up some reports. Bi-directional data exchange is another breakthrough in the Teradata Query Grid, new in release 15.0. The common thread in all these innovations is that the data moves from the memory of one system to the memory of the other. No extracts, no landing the data on disk until the final processing step – and sometimes not even then.

Push-down Processing
What we don’t want to do is transfer terabytes of data from Hadoop and throw away 90% of it since it’s not relevant. To minimize data movement, Teradata QueryGrid sends the remote system SQL filters that eliminate records and columns that aren’t needed. An example constraint could be “We only want records for single women age 30-40 with an average account balance over $5000. Oh, and only send us the account number, account type, and address.” This way, the Hadoop system discards unnecessary data so it doesn’t flood the network with data that will be thrown away. After all the processing is done in Hadoop, data is joined in the data warehouse, summarized, and delivered to the user’s favorite business intelligence tool.

Teradata QueryGrid delivers some important benefits:
• It’s easy to use: any user with any BI tool can do it
• Low DBA labor: it’s mostly setting up views and testing them once
• High performance: reducing hours to minutes means more accuracy and faster turnaround for demanding users
• Cross-system data on demand: don’t get stuck in programmer’s work queue
• Teradata/Aster strengths: security, workload management, system management
• Minimum data movement improves performance and reduces network use
• Move the processing to the data

Big data is now taking giant steps through your analytic architecture --frictionless, invisible, and in parallel. Nice boots!

 

In the Star Trek movies, “the Borg” refers to an alien race that conquers all planets, absorbing the people, technology, and resources into the Borg collective. Even Captain Picard becomes a Borg and chants “We are the Borg. You will be assimilated. Resistance is futile.”

It strikes me that the relational database has behaved similarly since its birth. Over the last thirty years, Teradata and other RDBMS vendors have innovated and modernized, constantly revitalizing what it means to be an RDBMS. But some innovations come from start-up companies that are later assimilated into the RDBMS. And some innovations are reactions to competition. Regardless, many innovations eventually end up in the code base of multiple RDBMS vendor products --with proper respect to patents of course. Here are some examples of cool technologies assimilated into Teradata Database:

• MOLAP cubes storm the market in the late 1990s with Essbase setting the pace and Cognos inventing desktop cubes. MicroStrategy and Teradata team up to build push-down ROLAP SQL into the database for parallel speed. Hyperion Essbase and Teradata also did Hybrid OLAP integration together. Essbase gets acquired, MOLAP cubes fall out of fashion, and in-database ROLAP goes on to provide the best of both worlds as CPUs get faster.

• Early in the 2000s, a startup called Sunopsis shows a distinct advantage of running ELT transformations in-database to get parallel performance with Teradata. ELT takes off in the industry like a rocket. Teradata Labs also collaborates with Informatica to push-down PowerCenter transformation logic into SQL for amazing extract, load, and transform speed. Sunopsis gets acquired. More ETL vendors adopt ELT techniques. Happy DBAs and operations managers meet their nightly batch performance goals. More startups disappear.

• XML and XQuery becomes the rage in the press -- until most every RDBMS adds a data type for XML --plus shred and unshred operators. XML-only database startups are marginalized.

• The uptick of predictive analytics in the market drives collaboration between Teradata and SAS back in 2007. SAS Procs are pushed-down into the database to run massively parallel, opening up tremendous performance benefits for SAS users. This leads to many RDBMS vendors who copy this technique; SAS is in the limelight, and eventually even Hadoop programmers want to run SAS in parallel. Later we see “R,” Fuzzy Logix, and others run in-database too. Sounds like the proverbial win-win to me.

• In-memory technology from QlikView and TIBCO SpotFire excites the market with order-of magnitude performance gains. Several RDBMS vendors then adopt in-memory concepts. But in-memory has limitations on memory size and cost vis-à-vis terabytes of data. Consequently, Teradata introduces Teradata Intelligent Memory that caches hot data automatically in-memory while managing many terabytes of hot and cold data on disk. Two to three percent of the hottest data is managed by data temperature (aka - popular with users), delivering superfast response time. Cool! Or is it hot?

• After reading the Google research paper on MapReduce, a startup called “AsterData” invents SQL-MapReduce (SQL-MR) to add flexible processing to a flexible database engine. This cool innovation causes Teradata to acquire AsterData. Within a year, Aster strikes a nerve across the industry – MapReduce is in-database! This month, Aster earns numerous #1 scores in Ovum’s “Decision Matrix: Selecting an Analytic Database 2013-14” Jan 2014. The race is on for MapReduce in-database!

• The NoSQL community grabs headlines with their unique designs and reliance on JSON data and key-value pairs. MongoDB is hot, using JSON data while CouchBase and Cassandra leverage key-value stores. Teradata promptly decides to add JSON data (unstructured data) to the database and goes the extra mile to put JSONPath syntax into SQL. Teradata also adds the name-value-pair SQL operator (NVP) to extract JSON or key-value store data from weblogs. Schema-on-read technology gets assimilated into the Teradata Database. Java programmers are pleased. Customers make plans. More wins.

--------------------------------------------------------------------------------------------------------

“One trend to watch going forward, in addition to the rise of multi-model NoSQL databases, is the integration of NoSQL concepts into relational databases. One of the methods used in the past by relational database vendors to restrict the adoption of new databases to handle new data formats has been to embrace those formats within the relational database. Two prime examples would be support for XML and object-oriented programming.”
- Matt Aslett, The 451 Group, Next-Generation Operational Databases 2012-2016, Sep 17, 2013

--------------------------------------------------------------------------------------------------------

I’ve had conversations with other industry analysts and they’ve confirmed Matt’s opinion: RDBMS vendors will respond to market trends, innovations, and competitive threats by integrating those technologies into their offering. Unlike the Borg, a lot of these assimilations by RDBMS are friendly collaborations (MicroStrategy, Informatica, SAS, Fuzzy Logix, Revolution R, etc.). Others are just the recognition of new data types that need to be in the database (JSON, XML, BLOBs, geospatial, etc.).

Why is it good to have all these innovations inside the major RDBMS’s? Everyone is having fun right now with their science projects because hype is very high for this startup or that startup or this shiny new thing. But when it comes time to deploy production analytic applications to hundreds or thousands of users, all the “ities” become critical all of a sudden – “ities” that the new kids don’t have and the RDBMS does. “ities” like reliability, recoverability, security, and availability. Companies like Google can bury shiny new 1.oh-my-god quality software in an army of brilliant computer scientists. But Main Street and Wall Street companies cannot.

More important, many people are doing new multi-structured data projects in isolation -- such as weblog analysis, sensor data, graph analysis, or social text analysis. Soon enough they discover the highest value comes from combining that data with all the rest of the data that the organization has collected on customers, inventories, campaigns, financials, etc. Great, I found a new segment of buyer preferences. What does that mean to campaigns, sales, and inventory? Integrating new big data into an RDBMS is a huge win going forward – much better than keeping the different data sets isolated in the basement.

Like this year’s new BMW or Lexus, RDBMS’s modernize, they define modern. But relational database systems don’t grow old, they don’t rust or wear out. RDBMS’s evolve to stay current and constantly introduce new technology.

We are the RDBMS! Technology will be assimilated. Resistance is futile.