teradata

Optimization in Data Modeling 1 – Primary Index Selection

Posted on: July 14th, 2015 by Guest Blogger No Comments

 

In my last blog I spoke about the decisions that must be made when transforming an Industry Data Model (iDM) from Logical Data Model (LDM) to an implementable Physical Data Model (PDM). However, being able to generate DDL (Data Definition Language) that will run on a Teradata platform is not enough – you also want it to perform well. While it is possible to generate DDL almost immediately from a Teradata iDM, each customer’s needs mandate that existing structures be reviewed against data and access demographics, so that optimal performance can be achieved.

Having detailed data and access path demographics during PDM design is critical to achieving great performance immediately, otherwise it’s simply guesswork. Alas, these are almost never available at the beginning of an installation, but that doesn’t mean you can’t make “excellent guesses.”

The single most influential factor in achieving PDM performance is proper Primary Index (PI) selection for warehouse tables. Data modelers are focused on entity/table Primary Keys (PK) since it is what defines uniqueness at the row level. Because of this, a lot of physical modelers tend to implement the PK as a Unique Primary Index (UPI) on each table as a default. But one of the keys to Teradata’s great performance is that it utilizes the PI to physical distribute data within a table across the entire platform to optimize parallelism. Each processor gets a piece of the table based on the PI, so rows from different tables with the same PI value are co-resident and do not need to be moved when two tables are joined.

In a Third Normal Form (3NF) model no two entities (outside of super/subtypes and rare exceptions) will have the same PK, so if chosen as a PI, it stands to reason that no two tables share a PI and every table join will require data from at least one table to be moved before a join can be completed – not a solid performance decision to say the least.

The iDM’s have preselected PI’s largely based on Identifiers common across subject areas (i.e. Party Id) so that all information regarding that ID will be co-resident and joins will be AMP-local. These non-unique PI’s (NUPI’s) are a great starting point for your PDM, but again need to be evaluated against customer data and access plans to insure that both performance and reasonably even data distribution is achieved.

Even data distribution across the Teradata platform is important since skewed data can contribute both to poor performance and to space allocation (run out of space on one AMP, run out of space on all). However, it can be overemphasized to the detriment of performance.

Say, for example, a table has a PI of PRODUCT_ID, and there are a disproportionate number of rows for several Products causing skewed distribution Altering the PI to the table PK instead will provide perfectly even distribution, but remember, when joining to that table, if all elements of the PK are not available then the rows of the table will need to be redistributed, most likely by PRODUCT_ID.

This puts them back under the AMP where they were in the skewed scenario. This time instead of a “rest state” skew the rows will skew during redistribution, and this will happen every time the table is joined to – not a solid performance decision. Optimum performance can therefore be achieved with sub-optimum distribution.

iDM tables relating two common identifiers will usually have one of the ID’s pre-selected as a NUPI. In some installations the access demographics will show that other ID may be the better choice. If so, change it! Or it may give leave you with no clear choice, in which case picking one is almost assuredly better than
changing the PI to a composite index consisting of both ID’s as this will only result in a table no longer co-resident with any table indexed by either of the ID’s alone.

There are many other factors that contribute to achieving optimal performance of your physical model, but they all pale in comparison to a well-chosen PI. In my next blog we’ll look at some more of these and discuss when and how best to implement them.

Jake Kurdsjuk Biopic-resize July 15

Jake Kurdsjuk is Product Manager for the Teradata Communications Industry Data Model, purchased by more than one hundred Communications Service Providers worldwide. Jake has been with Teradata since 2001 and has 25 years of experience working with Teradata within the Communications Industry, as a programmer, DBA, Data Architect and Modeler.

 

It is well-known that there are two extreme alternatives for storing database tables on any storage media: storing it row-by-row (as done by traditional “row-store” technology) or storing it column-by-column (as done by recently popular “column-store” implementations). Row-stores store the entire first row of the table, followed by the entire second row of the table, etc. Column-stores store the entire first column of the table, followed by the entire second column of the table, etc. There have been huge amounts of research literature and commercial whitepapers that discuss the various advantages of these alternative approaches, along with various proposals for hybrid solutions (which I discussed in more detail in my previous post).

abadi blog clamp image abadiDespite the many conflicting arguments in favor of these different approaches, there is little question that column-stores compress data much better than row-stores. The reason is fairly intuitive: in a column-store, entire columns are stored contiguously --- in other words, a series of values from the same attribute domain are stored consecutively. In a row-store, values from different attribute domains are interspersed, thereby reducing the self-similarity of the data. In general the more self-similarity (lower entropy) you have in a dataset, the more compressible it is. Hence, column-stores are more compressible than row-stores.

In general, compression rates are very sensitive to the particular dataset that is being compressed. Therefore it is impossible to make any kind of guarantees about how much a particular database system/compression algorithm will compress an arbitrary dataset. However, as a general rule of thumb, it is reasonable to expect around 8X compression if a column-store is used on many kinds of datasets. 8X compression means that the compressed dataset is 1/8th the original size, and scan-based queries over the dataset can thus proceed approximately 8 times as fast. This stellar compression and resulting performance improvements are a major contributor to the recent popularity of column-stores.

It is precisely this renowned compression of column-stores which makes the compression rate of RainStor (a recent Teradata acquisition) so impressive in comparison. RainStor claims a factor of 5 times more compression than what column-stores are able to achieve on the same datasets, and 40X compression overall.

Although the reason why column-stores compress data better than row-stores is fairly intuitive, the reason why RainStor can compress data better than column-stores is less intuitive. Therefore, we will now explain this in more detail.

Take for example the following table, which is a subset of a table describing orders from a particular retail enterprise that sells bicycles and related parts. (A real table would have many more rows and columns, but we keep this example simple so that it is easier to understand what is going on).

Record Order date Ship date Product Price
1 03/22/2015 03/23/2015 “bicycle” 300
2 03/22/2015 03/24/2015 “lock” 18
3 03/22/2015 03/24/2015 “tire” 70
4 03/22/2015 03/23/2015 “lock” 18
5 03/22/2015 03/24/2015 “bicycle” 250
6 03/22/2015 03/23/2015 “bicycle” 280
7 03/22/2015 03/23/2015 “tire” 70
8 03/22/2015 03/23/2015 “lock” 18
9 03/22/2015 03/24/2015 “bicycle” 280
10 03/23/2015 03/24/2015 “lock” 18
11 03/23/2015 03/25/2015 “bicycle” 300
12 03/23/2015 03/24/2015 “bicycle” 280
13 03/23/2015 03/24/2015 “tire” 70
14 03/23/2015 03/25/2015 “bicycle” 250
15 03/23/2015 03/25/2015 “bicycle” 280

 

The table contains 15 records and shows four attributes --- the order and ship dates of a particular product; the product that was purchased, and the purchase price. Note that there is a relationship between some of these columns --- in particular the ship date is usually 1 or 2 days after the order date, and that the price of various products are usually consistent across orders, but there may be slight variations in price depending on what coupons the customer used to make the purchase.

A column-store would likely use “run-length encoding” to compress the order date column. Since records are sorted by order date, this would compress the column to its near-minimum --- it can be compressed as (03/22/2015, 9); (03/23/2015, 6) --- which indicates that 03/22/2015 is repeated 9 straight times, followed by 03/23/2015 which is repeated 6 times. The ship date column, although not sorted, is still very compressible, as each value can be expressed using a small number of bits in terms of how much larger (or smaller) it is from the previous value in the column. However, the other two columns --- product and price --- would likely be compressed using a variant of dictionary compression, where each value is mapped to the minimal number of bits needed represent it. For large datasets, where there are many unique values for price (or even for product), the number of bits needed to represent a dictionary entry is non-trivial, and the same dictionary entry is repeated in the compressed dataset for every repeated value in the original dataset.

In contrast, in RainStor, every unique value in the dataset is stored once (and only once), and every record is represented as a binary tree, where a breadth-first traversal of the tree enables the reconstruction of the original record. For example, the table shown above is compressed in RainStor using the forest of binary trees shown below. There are 15 binary trees (each of the 15 roots of these trees are shown using the green circles at the top of the figure), corresponding to the 15 records in the original dataset.abadi forest trees blog

Forest of Binary Trees Compression

For example, the binary tree corresponding to record 1 is shown on the left side of the figure. The root points to two children --- the internal nodes “A” and “E”. In turn, node “A” points to 03/22/2015 (corresponding to the order date of record 1), and to 03/23/2015 (corresponding to the ship date of record 1). Node “E” points to “bicycle” (corresponding to the product of record 1) and “300” corresponding to the price of record 1).

Note that records 4, 6, and 7 also have an order date of 03/22/2015 and a ship date of 03/23/2015. Therefore, the roots of the binary trees corresponding to those records also point to internal node “A”. Similarly, note that record 11 also is associated with the purchase of a bicycle for $300. Therefore, the root for record 11 also points to internal node “E”.

These shared internal nodes are what makes RainStor’s compression algorithm fundamentally different from any algorithm that a column-store is capable of performing. Column-stores are forced to create dictionaries and search for patterns only within individual columns. In contrast, RainStor’s compression algorithm finds patterns across different columns --- identifying the relationship between ship date and order date and the relationship between product and price, and leveraging these relationships to share branches in the trees that are formed, thereby eliminating redundant information. RainStor thus has fundamentally more room to search for patterns in the dataset and compress data by referencing these patterns via the (compressed) location of the root of the shared branch.

For a traditional archiving solution, compression rate is arguably the most important feature (right up there with immutability). Indeed, RainStor’s compression algorithm enables it to be used for archival use-cases, and RainStor provides all of the additional features you would expect from an archiving solution: encryption, LDAP/AD/PAM/Kerberos/PCI authentication and security, audit trails and logging, retention rules, expiry policies, and integrated implementation of existing compliance standards (e.g. SEC 17a-4).

However, what brings RainStor to the next level in the archival solutions market is that it is an “active” archive, meaning that the data that is managed by RainStor can be queried at high performance. RainStor provides a mature SQL stack for native querying of compressed RainStor data, including ANSI SQL 1992 and 2003 parsers, and a full MPP query execution engine. For enterprises with Hadoop clusters, RainStor is fully integrated with the Cloudera and Hortonworks distributions of Hadoop --- RainStor compressed data files can be partitioned over a HDFS cluster, and queried in parallel with HiveQL (or MapReduce or Pig). Furthermore, RainStor integrates with YARN for resource management, with HCatalog for metadata management, and with Ambari for system monitoring and management.

The reason why most archival solutions are not “active” is that the compression algorithms used to reduce the data size before archival are so heavy-weight, that significant processing resources must be invested in decompressing the data before it can be queried. Therefore, it is preferable to leave the data archived in compressed form, and only decompress it at times of significant need. In general, a user should expect significant query performance reductions relative to querying uncompressed data, in order to account for the additional decompression time.

The beauty of RainStor’s compression algorithm is that even though it gets compression ratios comparable to other archival products, its compression algorithm is not so heavy-weight that the data must be decompressed prior to querying it. In particular, the binary tree structures shown above are actually fairly straightforward to perform query operations on directly, without requiring decompression prior to access. For example, a count distinct or a group-by operation can be performed via a scan of the leaves of the binary tees. Furthermore, selections can be performed via a reverse traversal of the binary trees from the leaves that match the selection predicate. In general, since there is a one-to-one mapping of records in the uncompressed dataset to the binary trees in RainStor’s compressed files, all query operations can be expressed in terms of operations on these binary trees. Therefore, RainStor queries can benefit from the I/O improvement of scanning in less data (due to the smaller size of the compressed files on disk/memory) without paying the decompression cost to fully decompress these compressed files after they are read from storage. This leads to RainStor’s claims of 2X-100X performance improvement on most queries --- an industry-leading claim in the archival market.

In short, RainStor’s strong claims around compression and performance are backed up by the technology that is used under the covers. Its compression algorithm is able to identify and remove redundancy both within and across columns. Furthermore, the resulting data structures produced by the algorithm are amenable to direct operation on the compressed data. This allows the compressed files to be queried at high performance, and positions RainStor as a leading active-archive solution.

_________________________________________________________________________

daniel abadi crop BLOG bio mgmt

Daniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). http://twitter.com/#!/daniel_abadi.

Real-Time SAP® Analytics: a look back and ahead

Posted on: August 18th, 2014 by Patrick Teunissen 1 Comment

 

On April 8, I hosted a webinar and my guest was Neil Raden, an independent data warehouse analyst. The topic of the webinar was: “Accessing of SAP ERP data for business analytics purposes” – which was built upon Neil’s findings in his recent white paper about the complexities of the integration of SAP data into the enterprise data warehouse. The attendance and participation in the webinar clearly showed that there is a lot of interest and expertise in this space. As I think back about the questions we received, both Neil and I were surprised by the number of questions that were related to “real-time analytics on SAP.”

Something has drastically changed in the SAP community!

Note: The topic of real time analytics is not new! I won’t forget Neil’s reaction when the questions came up. It was like he was in a time warp back to the early 2000’s when he first wrote about that topic. Interestingly, Neil’s work is still very relevant today.

This made me wonder why this is so prominent in the SAP space now? What has changed in the SAP community? What has changed in the needs of the business?

My hypothesis is that when Neil originally wrote his paper (in 2003) R/3 was SAP (or SAP was R/3 whatever order you prefer) and integration with other applications or databases was not something that SAP had on the radar yet. This began to change when SAP BW became more popular and gained even more traction with the release of SAP’s suite of tools and modules (CRM, SRM, BPC, MDM, etc.) -- although these solutions still clearly had the true SAP ‘Made in Germany’ DNA. Then came SAP’s planning tool APO, Netweaver XI (later PI) and, the 2007 acquisition of Business Objects (including BODS) which all accelerated SAP’s application integration techniques.

With Netweaver XI/PI and Business Objects Data Services, it became possible to integrate SAP R/3 in real time, making use of advanced messaging techniques like Idoc’s, RFC’s, and BAPI’s. These techniques all work very well for transaction system integration (EAI); however, these techniques do not have what it takes to provide real-time data feeds to the integrated data warehouse. At best a hybrid approach is possible. Back in 2000 my team worked on such a hybrid project at Hunter Douglas (Luxaflex). They combined classical ABAP-driven batch loads for managerial reports with real time capabilities (BAPI calls) for their more operational reporting needs. That was state-of-art in those days!

Finally, in 2010 SAP acquired Sybase and added a best of breed Data Replication software tool to the portfolio. With this integration technique, changed data is captured directly from the database taking the loads off of the R/3 application servers. This offers huge advantages, so it makes sense that this is now the recommended technique for loading data into the SAP HANA appliance.

“What has changed is that SAP has put the need for real-time data integration with R/3 on the (road) map!”

The main feature of our upcoming release of Teradata Analytics for SAP Solutions version 2.2 is a new data replication technique. Almost designed to prove my case, 10 years ago I was in the middle of working on a project for a large multinational company. One of my lead engineers, Arno Luijten, came to me with a proposal to try out a data replication tool to address the latencies introduced by the extraction of large volumes of changed data from SAP. We didn’t get very far at the time, because the technology and the business expectations were not ready for it. Fast forward to 2014 and we’re re-engaged with this same customer …. Luckily this time the business needs and the technology capabilities are ready to deliver!

In the coming months my team and I would like to take you on our SAP analytics journey.

In my next posts we will dive into the definition (and relativity) of real-time analytics and discuss the technical complexities of dealing with SAP including the pool and cluster tables. So, I hope I got you hooked for the rest of the series!

Garbage In-Memory, Expensive Garbage

Posted on: July 7th, 2014 by Patrick Teunissen 2 Comments

 

A first anniversary is always special and in May I marked my first with Teradata. In my previous lives I celebrated almost ten years with Shell and seventeen years creating my own businesses focused on data warehousing and business intelligence solutions for SAP. With my last business “NewFrontiers” I leveraged all twenty seven years of ERP experiences to develop a shrink wrapped solution to enable SAP analytics. 

Through my first anniversary with Teradata, all this time, the logical design of SAP has been the same. To be clear, when I say SAP, I mean R/3 or ‘R/2 with a mouse’ if you’re old enough to remember. Today R/3 is also known as the SAP Business suite, ERP or whatever. Anyway, when I talk about SAP I mean the application that made the company rightfully world famous and that is used for transaction processing by almost all large multinational businesses.

My core responsibility at Teradata is the engineering of the analytical solution for SAP. My first order of business was focusing my team on delivering an end-to-end business analytic product suite to analyze ERP data that is optimized for Teradata. Since completing our first release, my attention turned to adding new features to help companies take their SAP analytics to the next level. To this end, my team is just putting the finishing touches on a near real-time capability based on data replication technology. This will definitely be the topic of upcoming blogs.

Over the past year, the integration and optimization process has greatly expanded my understanding of the differentiated Teradata capabilities. The one capability that draws in the attention of types like me ‘SAP guys and girls’ is Teradata Intelligent Memory. In-memory computing has become a popular topic in the SAP community and the computer’s main memory is an important part of Teradata’s Intelligent Memory. However Intelligent Memory is more than “In-Memory” -- because with Intelligent Memory, the database addresses the fact that not all memory is created equal and delivers a solution that uses the “right memory for the right purpose”. In this solution, the most frequently used data – the hottest -- is stored In-Memory; the warm data is processed from a solid state drive (SSD), and colder, less frequently accessed data from a hard disc drive (HDD). This solution allows your business to make decisions on all of your SAP and non-SAP data while coupling in-memory performance with spinning disc economics.

This concept of using the “right memory for the right purpose” is very compelling for our Teradata Analytics for SAP solutions. Often when I explain what Teradata Analytics for SAP Solutions does, I draw a line between DATA and CONTEXT. Computers need DATA like cars need fuel and the CONTEXT is where you drive the car. Most people do not go the same place every time but they do go to some places more frequently than others (e.g. work, freeways, coffee shops) and under more time pressure (e.g. traffic).

In this analogy, most organizations almost always start building an “SAP data warehouse” by loading all DATA kept in the production database of the ERP system. We call that process the initial load. In the Teradata world we often have to do this multiple times because when building an integrated data warehouse it usually involves sourcing from multiple SAP ERPs. Typically, these ERPs vary in age, history, version, governance, MDM, etc. Archival is a non-trivial process in the SAP world and the majority of the SAP systems I have seen are carrying many years of old data . Loading all this SAP data In-Memory is an expensive and reckless thing to do.

Teradata Intelligent Memory provides CONTEXT by storing the hot SAP data In-Memory, guaranteeing lightning fast response times. It then automatically moves the less frequently accessed data to lower cost and performance discs across the SSD and HDD media spectrum. The resulting combination of Teradata Analytics for SAP coupled with Teradata’s Intelligent Memory delivers in-memory performance with very high memory hit rates at a fraction of the cost of ‘In-Memory’ solutions. And in this business, costs are a huge priority.

The title of this Blog is a variation on the good old “Garbage In Garbage Out / GIGO” phrase; In-Memory is a great feature, but not all data needs to go there! Make use of it in an intelligent way and don’t use it as a garbage dump because for that it is too expensive.

Patrick Teunissen is the Engineering Director at Teradata responsible for the Research & Development of the Teradata Analytics for SAP® Solutions at Teradata Labs in the Netherlands. He is the founder of NewFrontiers which was acquired by Teradata in May 2013.

Endnotes:
1 Needless to say I am referring to SAP’s HANA database developments.

2 Data that is older than 2 years can be classified as old. Transactions, like sales and costs are often compared with the a budget/plan and the previous year. Sometimes with the year before that but hardly ever with data older than that.

MongoDB and Teradata QueryGrid – Even Better Together

Posted on: June 19th, 2014 by Dan Graham 3 Comments

 

It wasn’t so long ago that NoSQL products were considered competitors with relational databases (RDBMS). Well, for some workloads they still are. But Teradata is an analytic RDBMS which is quite different and complementary to MongoDB. Hence, we are teaming up for the benefit of mutual customers.

The collaboration of MongoDB with Teradata represents a virtuous cycle, a symbiotic exchange of value. This virtuous cycle starts when data is exported from MongoDB to Teradata’s Data Warehouse where it is analyzed and enriched, then sent back to MongoDB to be exploited further. Let me give an example.

An eCommerce retailer builds a website to sell clothing, toys, etc. They use MongoDB because of the flexibility to manage constantly changing web pages, product offers, and marketing campaigns. This front office application exports JSON data to the back-office data warehouse throughout the business day. Automated processes analyze the data and enrich it, calculating next best offers, buyer propensities, consumer profitability scores, inventory depletions, dynamic discounts, and fraud detection. Managers and data scientists also sift through sales results looking for trends and opportunities using dashboards, predictive analytics, visualization, and OLAP. Throughout the day, the data warehouse sends analysis results back to MongoDB where they are used to enhance the visitor experience and improve sales. Then we do it again. It’s a cycle with positive benefits for the front and back office.

Teradata Data Warehouses have been used in this scenario many times with telecommunications, banks, retailers, and other companies. But several things are different working with MongoDB in this scenario. First, MongoDB uses JSON data. This is crucial to frequently changing data formats where new fields are added on a daily basis. Historically, RDBMS’s did not support semi-structured JSON data. Furthermore, the process of changing a database schema to support frequently changing JSON formats took weeks to get through governance committees.

Nowadays, the Teradata Data Warehouse ingests native JSON and accesses it through simple SQL commands. Furthermore, once a field in a table is defined as JSON, the frequently changing JSON structures flow right into the data warehouse without spending weeks in governance committees. Cool! This is a necessary big step forward for the data warehouse. Teradata Data Warehouses can ingest and analyze JSON data easily using any BI tool or ETL tool our customers prefer.

Another difference is that MongoDB is a scale-out system, growing to tens or hundreds of server nodes in a cluster. Hmmm. Teradata systems are also scale-out systems. So how would you exchange data between Teradata Data Warehouse server nodes and MongoDB server nodes? The simple answer is to export JSON to flat files and import them to the other system. Mutual customers are already doing this. Can we do better than import/export? Can we add an interactive dynamic data exchange? Yes, and this is the near term goal of our partnership --connecting Teradata QueryGrid to MongoDB clusters.

Teradata QueryGrid and Mongo DB

Teradata QueryGrid is a capability in the data warehouse that allows a business user to issue requests via popular business intelligence tools such as SAS®, Tableau®, or MicroStrategy®. The user issues a query which runs inside the Teradata Data Warehouse. This query reaches across the network to the MongoDB cluster. JSON data is brought back, joined to relational tables, sorted, summarized, analyzed, and displayed to the business user. All of this is done exceptionally fast and completely invisible to the business user. It’s easy! We like easy.

QueryGrid can also be bi-directional, putting the results of an analysis back into the MongoDB server nodes. The two companies are working on hooking up Teradata QueryGrid right now and we expect to have the solution early in 2015.

The business benefit of connecting Teradata QueryGrid to MongoDB is that data can be exchanged in near real time. That is, a business user can run a query that exchanges data with MongoDB in seconds (or a few minutes if the data volume is huge). This means new promotions and pricing can be deployed from the data warehouse to MongoDB with a few mouse clicks. It means Marketing people can analyze consumer behavior on the retail website throughout the day, making adjustments to increase sales minutes later. And of course, applications with mobile phones, sensors, banking, telecommunications, healthcare and others will get value from this partnership too.

So why does the leading NoSQL vendor partner with the best in class analytic RDBMS? Because they are highly complementary solutions that together provide a virtuous cycle of value to each other. MongoDB and Teradata are already working together well in some sites. And soon we will do even better.

Come visit our Booth at MongoDB World and attend the session “The Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse” Riverside Suite, 3:10 p.m., June 24. You can read more about the partnership between Teradata and MongoDB in this news release issued earlier today. Also, check out the MongoDB blog.

PS: The MongoDB people have been outstanding to work with on all levels. Kudos to Edouard, Max, Sandeep, Rebecca, and others. Great people!

 

It happens every few years and it’s happening again. A new technology comes along and a significant segment of the IT and business community want to toss out everything we’ve learned over the past 60 years and start fresh. We “discover” that we’ve been wasting time applying unnecessary rigor and bureaucracy to our projects. No longer should we have to wait three to six months or longer to deliver technical solutions to the business. We can turn these things around in three to six days or even less.

In the mid 1990’s, I was part of a team that developed a “pilot” object-oriented, client-server (remember when these were the hot buzzwords?) application to replenish raw materials for a manufacturing function. We were upending the traditional mainframe world by delivering a solution quickly and iteratively with a small team. When the end users started using the application in real life, it was clear they were going to rely on it to do their jobs every day. Wait, was this a pilot or…? I would come into work in the morning, walk into a special room that housed the application and database servers, check the logs, note any errors, make whatever fixes needed to be made, re-run jobs, and so on.

It wasn’t long before this work began to interfere with my next project, and the end users became frustrated when I wasn’t available to fix problems quickly. It took us a while and several conversations with operations to determine that “production” didn’t just mean “the mainframe”. “Production” meant that people were relying on the solution on a regular basis to do their jobs. So we backtracked and started talking about what kind of availability guarantees we could make, how backup and recovery should work, how we could transition monitoring and maintenance to operations, and so on. In other words, we realized what we needed was a traditional IT project that just happened to leverage newer technologies.

This same scenario is happening today with Hadoop and related tools. When I visit client organizations, a frightening number will have at least one serious person saying something like, “I really don’t think ‘data warehousing’ makes sense any more. It takes too long. We should put all our data in Hadoop and let our end users access whatever they want.” It is indeed a great idea to establish an environment that enables exploration and quick-turnaround analysis against raw data and production data. But to position this approach as a core data and analytics strategy is nothing short of professional malpractice.

The problem is that people are confusing experimentation with IT projects. There is a place for both, and there always has been. Experimentation (or discovery, research, ad-hoc analysis, or whatever term you wish to use) should have lightweight processes and data management practices – it requires prioritization of analysis activity, security and privacy policies and implementation, some understanding of available data, and so on, but it should not be overburdened with the typical rigor required of projects that are building solutions destined for production. Once a prototype is ready to be used on a regular basis for important business functions, that solution should be built through a rigorous IT project leveraging an appropriate – dare I say it – solution development life cycle (SDLC), along with a comprehensive enterprise architecture plan including, yes, a data warehouse that provides integrated, shared, and trusted production data.

An experimental prototype should never be “promoted” to a production environment. That’s what a project is for. Experimentation can be accomplished with Hadoop, relational technology, Microsoft Office, and many other technologies. These same technologies can also be used for production solutions. So, it’s not that “things are done differently and more quickly in Hadoop”. Instead, it’s more appropriate to say that experimentation is different than an IT project, regardless of technology.

Yes, we should do everything we can to reduce unnecessary paperwork and to speed up delivery using proper objective setting, scoping, and agile development techniques. But that is different than abandoning rigor altogether. In fact, using newer technologies in IT projects requires more attention to detail, not less, because we have to take the maturity of the technology into consideration. Can it meet the service level needs of a particular solution? This needs to be asked and examined formally within the project.

Attempting to build production solutions using ad-hoc, experimental data preparation and analysis techniques is like building a modern skyscraper with a grass hut mentality. It just doesn’t make any sense.

Guest Blogger Kevin Lewis is responsible for Teradata’s Strategy and Governance practice. Prior to joining Teradata in 2007, he was responsible for initiating and leading enterprise data management at Publix Super Markets. Since joining Teradata, he has advised dozens of clients in all major industries. 

How $illy is Cost per Terabyte?

Posted on: May 16th, 2014 by Dan Graham No Comments

 

Without question, the best price per terabyte anywhere in the technology industry is the home PC. You can get a Dell® PC at about $400 and it comes with a terabyte disk drive. WOW! I found one PC for $319 per TB! Teradata, Oracle, IBM, and all the other vendors are headed for the scrap heap of history with those kinds of prices. I’m sending out my resume in the morning. . . How silly is that? Yet when comparing massively parallel database computers – the culmination of 50 years of data processing innovation-- many organizations overemphasize $/TB and disregard total value. They hammer the vendors to lower the price, lower the price, until – you guessed it – the vendors hit the right price by also lowering the value. This reached a crescendo over the last few years following the worldwide recession. Saving money became much more important than producing business value. I get it – a corporation runs on cost containment and revenue generation. As it turns out, a data warehouse is a vital tool enabling both business objectives – especially in hard economic times.

I understand why CFOs and procurement people obsess on dollars per terabyte. They can’t understand all the technical geek-speak but they do know that hollering about cost per terabyte makes vendors and CIOs scramble. OK, that seems worthwhile but there is a flaw in this thinking when $/TB is the first and foremost buying criteria.

By analogy, would you buy a car based on price alone? No. Even if you are strapped for money, you search for features and value in the collection of cars that are affordable. Price is one decision point, not THE decision maker. I always buy a little beyond my means to get the highest quality. Purchase price is a point in time angst but I have to live with that car for years. It’s never failed me and I am always satisfied years later.

$/TB as Proxy for All the Value
System price is crucial at the beginning of a purchasing process to select candidates, and again at the end when real money is being exchanged. In between, there is often an assumption that candidate systems can all do the same job. Well, no two vendor systems are identical, especially massively parallel data warehouses. Indeed, they vary dramatically. But let’s assume for a moment that two vendor products are 80% equivalent in the workloads they can do and the labor it takes to manage them.

What is always lost in these comparisons is the actual performance of the queries as measured at the business user’s desk. Massively parallel databases are highly differentiated. Some are quite slow when compared to others. Some are lightning fast on table scans then choke when complex joins are needed. Some can only handle a dozen users at a time. Many flounder running mixed workloads. Some are good enough at simple queries on simple database designs, but collapse when complex queries are required. If you are just starting out, simple queries may be OK. But to become an analytic competitor, really complex queries are inevitably de rigueur. Plus, any successful analytic database project will see major expansions of user demands and query complexity over the first 3-5 years, then incremental after that. Or is it the other way around --top quality analytic databases encourage users to ask more complex questions? Hmmm.

Performance Performance Performance
The primary purpose of databases has always been performance, performance, performance. Number two is high availability since performance is uninteresting when the system is offline. Over-emphasizing cost per terabyte drives out the value of performance. But if the buyer wants vendors to optimize for cost per terabyte, query performance and software features will be reduced to meet that goal.

This means having employees do extra work since the system is no longer doing it. This means user productivity and accuracy is reduced as dozens of data warehouse users take extra minutes to do what could have been done in seconds. It means not rerunning an analysis four times to improve accuracy because it takes too long. It means users interact less with the data and get fewer brilliant insights because it takes too long to try out ideas. And it means not getting that rush report to the executives by 11AM when they ask for it at 10:40. All of this angst is hard to measure but the business user surely feels it.

The better metric has always been price/performance. Let me suggest an even more rounded (wink) view of buying criteria and priority:

 

 

 

 

 

---No, today is not the day to delve deeply into the percentages on this chart. But suffice it to say they are derived from analyst house research and other sources I’ve witnessed over the years. And yes they vary a few percentage points for every organization. Instead of price, TCO is dramatically more important to the CIO and CFO “who has to live with this car for years.” Performance is vital to the business user – cut this back and you might ask “why pretend to have an analytic database since users will avoid running queries?” Features and functions are something the programmers and DBAs love and should not be overlooked.

Teradata – the Low Price Leader?
Changes in supplier costs and price pressures from the recent recession are producing bargains for data warehouse buyers. Take a look at Teradata list prices from 2Q2014.

 

 

Each Teradata platform described above  includes Teradata quality hardware, the Teradata Database, utilities, and storage using uncompressed data. These are list prices so let the negotiations begin! With $3.8K per terabyte, anyone can afford Teradata quality now.

Obviously you noticed the $34K/terabyte systems. Need I say that these are the most robust, highest performing systems in the data warehouse market? Both Gartner’s Magic Quadrant and Forrester’s Data Warehouse Wave assessments rate Teradata the top data warehouse vendor as of 1Q14. These systems support large user populations, millions of concurrent queries per day, integrated data, sub-second response time on many queries, row level security, and dozens of applications per system. The Active Enterprise Data Warehouse is the top of the line with solid state disks, the fastest configuration, capacity on demand, and many other upscale capabilities. The Integrated Big Data Platform is plenty fast but not in the same class as the Active Enterprise Data Warehouse. There are a dozen great use cases for this cost conscious machine but 500 users with enormously complex queries won’t work on smaller configurations. But it quickly pays for itself.

Chant: Dollars per Terabyte, Dollars per Terabyte ...
The primary value proposition on the lips of the NoSQL and Hadoop vendors is always “cost per terabyte.” This is common with new products in new markets – we’ve heard it before from multiple of startup MPP vendors. It’s impossible to charge top dollar for release 1.0 or 2.0 since they are still fairly incomplete. So when you have little in the way of differentiated value, dollars per terabyte is the chant. But is five year old open source software really equivalent to 30 years of R&D investment in relational database performance? Not.

I looked at InformationWeek’s article on “10 Hadoop Hardware Leaders” (4/24/2014) which includes the Dell R720XD servers as a leader in Hadoop hardware. Pricing out an R720XD on the Dell website, I found a server with 128GB of memory and twelve 1.2TB disks comes in at $15,276. That’s $1060 per terabyte. Cool. However, Hadoop needs two replicas of all data to provide basic high availability. That means you need to buy three nodes. This makes the cost per terabyte $3182. Then you add some free software and lots of do-it-yourself labor. Seems to me that puts it in the same price band as the Integrated Big Data Platform. But the software on that machine is the same Teradata Database running on the Active Enterprise Data Warehouse. Sounds like a bargain to me!

Conclusion
Over reliance on $/TB does bad things to your business user’s productivity. Startups always make this a gut wrenching issue for customers to solve but as their products mature, that noise fades into the background. I recommend a well-rounded assessment of any vendor product that serves many business users and needs.

Ok, so now, I’m hooking up 50 terabytes of storage to my whiz bang 3.6Ghz Intel® home office Dell PC. I’m anxious to know how long it will take to scan and sort 20 terabytes. I’ll let you know tomorrow, or the next day, or whenever it finishes.

Dan Graham is responsible for strategy, go-to-market success, and competitive differentiation for the Active Data Warehouse platform and Extreme Performance Appliances at Teradata.

 

Every self-respecting data management professional knows that “business alignment” is critical to the success of a data and analytics program. But what does business alignment really mean? How do you know if your program is aligned to the business?

Before describing what business alignment is, let me first list what it is not:
• Interviewing end users to understand their needs for data and analytics
• Recruiting a highly placed and influential executive sponsor
• Documenting a high return on investment
• Gaining agreement on the data strategy from multiple business areas
• Establishing a business-led data governance program
• Establishing a process to prioritize data requests and issues

It’s not that the items on this list are bad ideas. It’s just that they are missing a key ingredient that, in my experience with dozens of clients, makes all the difference. None of these items are even the best first step in developing a data strategy.

So what’s wrong with the list? Let me illustrate with an example. I was working with a team developing a data strategy for a large manufacturing company. We were interviewing a couple of high level managers in marketing, and it went something like this:

Me: What are some of the major business initiatives that you’re expected to deliver this year and next year? Do you have some thoughts on the data and analytics that will be needed within those initiatives?

Marketing manager: Sure, well, we have this targeted marketing initiative that we think will be a big winner. When a customer contacts us for warranty information, we think we can cross-sell products from another business unit… here’s a spreadsheet… we’ve calculated that this will bring back $14 million in additional revenue every year. We’re so excited that you’re doing the data warehouse initiative… We’ve been proposing this marketing idea for the last four years and haven’t been able to get it approved, and now we can finally get it done!

Me: I didn’t ask what you think the business initiatives should be; I asked you what they already are! (Ok, I really didn’t say it that way, but I wanted to.)

Why couldn’t they get the project approved? Who knows? Maybe the ROI was questionable. Maybe the idea wasn’t consistent with the company strategy and image. All that matters is that it was not approved, and hence makes for a lousy value proposition for a data and analytics program.

There is nothing wrong with proposing exciting, new “art of the possible” ways that data can bring value to the business. But an interesting proposal and an approved initiative are not the same thing. The difference is crucial, and data management leaders who don’t understand this difference are unlikely to be seen as trusted strategic advisors within their companies.

So what does it mean to be business aligned? Business alignment means being able to clearly state how deployment of data, analytics, and data management capabilities will directly support planned and approved (meaning funded) business initiatives.

So, the first step toward developing a successful data strategy is not to ask the end users what data they want. Instead, the first step is to simply find the top business initiatives. They are usually not hard to find. Very often, there are posters all over the place about these initiatives. There are a number of people in the organization you can check with to find top initiatives - the CIO, PMO leads, IT business liaisons, and contacts in the strategic planning department are examples of good places to start.

Then, you should examine the initiatives and determine the data and analytics that will be needed to make each initiative successful, especially looking for the same data needed by multiple projects across multiple initiatives. Core, enterprise data is usually needed by a diverse set of initiatives in slightly different form. For example, let’s say you work for a retailer and you identify approved projects for pricing optimization, labor planning, and marketing attribution. You can make a case that you will deploy the sales and product data these applications need, in the condition needed, in the time frame needed.

Proceeding further, you can propose and champion a series of projects that deliver the data needed by various initiatives. By doing this, along with establishing architecture and design principles of scalability and extensibility, you harness the energy of high-priority projects (instead of running away from it) to make your business case, add value by supporting the value of pre-vetted initiatives, and also build a foundation of integrated and trusted data step by step, project by project. Once this plan is established and in motion, you can accurately state that your program is absolutely needed by the business and you are also deploying data the right way – and you can also say that your program is officially business aligned.

Guest Blogger Kevin Lewis is responsible for Teradata’s Strategy and Governance practice. Prior to joining Teradata in 2007, he was responsible for initiating and leading enterprise data management at Publix Super Markets. Since joining Teradata, he has advised dozens of clients in all major industries.

 

In the Star Trek movies, “the Borg” refers to an alien race that conquers all planets, absorbing the people, technology, and resources into the Borg collective. Even Captain Picard becomes a Borg and chants “We are the Borg. You will be assimilated. Resistance is futile.”

It strikes me that the relational database has behaved similarly since its birth. Over the last thirty years, Teradata and other RDBMS vendors have innovated and modernized, constantly revitalizing what it means to be an RDBMS. But some innovations come from start-up companies that are later assimilated into the RDBMS. And some innovations are reactions to competition. Regardless, many innovations eventually end up in the code base of multiple RDBMS vendor products --with proper respect to patents of course. Here are some examples of cool technologies assimilated into Teradata Database:

• MOLAP cubes storm the market in the late 1990s with Essbase setting the pace and Cognos inventing desktop cubes. MicroStrategy and Teradata team up to build push-down ROLAP SQL into the database for parallel speed. Hyperion Essbase and Teradata also did Hybrid OLAP integration together. Essbase gets acquired, MOLAP cubes fall out of fashion, and in-database ROLAP goes on to provide the best of both worlds as CPUs get faster.

• Early in the 2000s, a startup called Sunopsis shows a distinct advantage of running ELT transformations in-database to get parallel performance with Teradata. ELT takes off in the industry like a rocket. Teradata Labs also collaborates with Informatica to push-down PowerCenter transformation logic into SQL for amazing extract, load, and transform speed. Sunopsis gets acquired. More ETL vendors adopt ELT techniques. Happy DBAs and operations managers meet their nightly batch performance goals. More startups disappear.

• XML and XQuery becomes the rage in the press -- until most every RDBMS adds a data type for XML --plus shred and unshred operators. XML-only database startups are marginalized.

• The uptick of predictive analytics in the market drives collaboration between Teradata and SAS back in 2007. SAS Procs are pushed-down into the database to run massively parallel, opening up tremendous performance benefits for SAS users. This leads to many RDBMS vendors who copy this technique; SAS is in the limelight, and eventually even Hadoop programmers want to run SAS in parallel. Later we see “R,” Fuzzy Logix, and others run in-database too. Sounds like the proverbial win-win to me.

• In-memory technology from QlikView and TIBCO SpotFire excites the market with order-of magnitude performance gains. Several RDBMS vendors then adopt in-memory concepts. But in-memory has limitations on memory size and cost vis-à-vis terabytes of data. Consequently, Teradata introduces Teradata Intelligent Memory that caches hot data automatically in-memory while managing many terabytes of hot and cold data on disk. Two to three percent of the hottest data is managed by data temperature (aka - popular with users), delivering superfast response time. Cool! Or is it hot?

• After reading the Google research paper on MapReduce, a startup called “AsterData” invents SQL-MapReduce (SQL-MR) to add flexible processing to a flexible database engine. This cool innovation causes Teradata to acquire AsterData. Within a year, Aster strikes a nerve across the industry – MapReduce is in-database! This month, Aster earns numerous #1 scores in Ovum’s “Decision Matrix: Selecting an Analytic Database 2013-14” Jan 2014. The race is on for MapReduce in-database!

• The NoSQL community grabs headlines with their unique designs and reliance on JSON data and key-value pairs. MongoDB is hot, using JSON data while CouchBase and Cassandra leverage key-value stores. Teradata promptly decides to add JSON data (unstructured data) to the database and goes the extra mile to put JSONPath syntax into SQL. Teradata also adds the name-value-pair SQL operator (NVP) to extract JSON or key-value store data from weblogs. Schema-on-read technology gets assimilated into the Teradata Database. Java programmers are pleased. Customers make plans. More wins.

--------------------------------------------------------------------------------------------------------

“One trend to watch going forward, in addition to the rise of multi-model NoSQL databases, is the integration of NoSQL concepts into relational databases. One of the methods used in the past by relational database vendors to restrict the adoption of new databases to handle new data formats has been to embrace those formats within the relational database. Two prime examples would be support for XML and object-oriented programming.”
- Matt Aslett, The 451 Group, Next-Generation Operational Databases 2012-2016, Sep 17, 2013

--------------------------------------------------------------------------------------------------------

I’ve had conversations with other industry analysts and they’ve confirmed Matt’s opinion: RDBMS vendors will respond to market trends, innovations, and competitive threats by integrating those technologies into their offering. Unlike the Borg, a lot of these assimilations by RDBMS are friendly collaborations (MicroStrategy, Informatica, SAS, Fuzzy Logix, Revolution R, etc.). Others are just the recognition of new data types that need to be in the database (JSON, XML, BLOBs, geospatial, etc.).

Why is it good to have all these innovations inside the major RDBMS’s? Everyone is having fun right now with their science projects because hype is very high for this startup or that startup or this shiny new thing. But when it comes time to deploy production analytic applications to hundreds or thousands of users, all the “ities” become critical all of a sudden – “ities” that the new kids don’t have and the RDBMS does. “ities” like reliability, recoverability, security, and availability. Companies like Google can bury shiny new 1.oh-my-god quality software in an army of brilliant computer scientists. But Main Street and Wall Street companies cannot.

More important, many people are doing new multi-structured data projects in isolation -- such as weblog analysis, sensor data, graph analysis, or social text analysis. Soon enough they discover the highest value comes from combining that data with all the rest of the data that the organization has collected on customers, inventories, campaigns, financials, etc. Great, I found a new segment of buyer preferences. What does that mean to campaigns, sales, and inventory? Integrating new big data into an RDBMS is a huge win going forward – much better than keeping the different data sets isolated in the basement.

Like this year’s new BMW or Lexus, RDBMS’s modernize, they define modern. But relational database systems don’t grow old, they don’t rust or wear out. RDBMS’s evolve to stay current and constantly introduce new technology.

We are the RDBMS! Technology will be assimilated. Resistance is futile.

Evaluating and Planning for the Real Costs of Big Data

Posted on: January 16th, 2014 by Dan Graham No Comments

 

In a blog I posted in early December, I talked about the total cost of big data. That post, and today’s follow-up post, stem from a webinar that I moderated between Richard Winter, President of Wintercorp, specializing in massive databases, and Bob Page, VP of Products at Hortonworks. During the webinar we discussed how to successfully calibrate and calculate the total cost of data and walked through important lessons related to the costs around running workloads on various platforms including Hadoop. If you haven’t listened to the webinar yet, I recommend you do so.

From the discussion we had during that session and from resulting conversations I have had since, I wanted to address some of the key takeaways we discussed about how to be successful when tackling such a large challenge within your organization. Here are a few key points to consider:

1. Start Small: As Bob Page said, “It’s very easy to dream big and go overboard with these projects, but the key to success is starting small.” Have your first project be a straightforward proof of concept. There are undoubtedly going to be challenges when you are starting your first big data project, but if you can start at a smaller level and build your knowledge and capabilities, your odds of success for the larger projects improve. Don’t make your first venture out of the gate an attempt at a gargatuan project or huge amount of data. When you have some positive results, you will also have the confidence and sanction to build bigger solutions.

2. Address the Entire Scope of Costs: Rather than making the mistake of focusing on upfront purchasing costs only, any total cost of data evaluation must incorporate all possible costs, reflecting an estimate of owning and using data over time for analytic purposes. The framework that Richard developed allows you to do exactly that. It is a framework for estimating the total cost of a big data initiative. During the webinar, Richard discussed the five components of system costs:

  • the hardware acquisition costs
  • the software acquisition costs
  • what you pay for support
  • what you pay for upgrades
  • and what you pay for environmental/infrastructure costs – power and cooling.

According to Richard, we need to estimate the CAPEX and OPEX over five years.  Based on his extensive experience, he also recommends a moderate annual growth assumption of 26 percent in the system capacity. In my experience, most data warehouses double in size every 3 years so Richard is being conservative. Thus the business goal coupled with the CAPEX and OPEX thresholds year by year helps keep the team focused.  For many technical people, the TCOD planning seems like a burden, but it’s actually a career saver. If you are able to control the scope at a relatively low level and can leverage a tool - such as Richard’s framework – you have a higher chance of being successful.

3. Comparison Shop: Executives want to know the total cost of carrying out a large project, whether it is on a data warehouse or Hadoop. Having the ability to compare overall costs between the two systems is important to the overall internal success of the project and to the success of future projects being evaluated as well. Before you can compare anything, it is important to identify a real workload that your business and the executive team can consider funding.  The real workload focuses the comparisons as opposed to generalizations and guesses.  At some point a big data platform selection will generate two analyses you need to work though: 1) what is this workload costing? and 2) which platform can technically accomplish the goals more easily?” Lastly, in a perfect world, the business users should also be able to showcase the business value of the workload.

4. Align Your Stakeholders: Many believe that 60 percent of the work in a project should be in the planning and 40 percent of the execution. In order to evaluate your big data project appropriately, you must incorporate as many variables as possible.  It’s the surprises and stakeholders who weren’t aligned that cause a lot of the big cost over runs. Knowing your assets and stakeholders is key to succeeding. Which is why we recommend using the TCOD framework to get stakeholders to weigh in and achieve alignment on the overall plan. Next, leverage the results as a project plan that you can use toward achieving ROI. By leveraging a framework such as the one that Richard discusses during the webinar, what becomes very clear is that having each assumption, each formula and each of the costs exposed within this framework (in Richard’s there are 60 different costs outlined!), you can identify much more easily where the costs differ and – more importantly – why. The TCOD framework can bring stakeholders into the decision-making process, forming a committed team instead of bystanders and skeptics .

5. Focus on Data Management: One of the things that both of our esteemed webinar guests pointed out is the importance of the number of people and applications accessing big data simultaneously. Data is typically the life-blood of the organization. This includes accessing live information about what is happening now, as well as accurate reporting at the end of the day, month, and quarter. There is a wide spectrum of use cases and each is being used across a wide variety of data types. If you haven’t actually built a 100-terabyte database or distributed file system before, be ready for some painful “character building” surprises. Be ready again at 500TB, at a petabyte, and 5 petabytes. Big data volumes are like the difference between a short weekend hike and making it past base camp on Mount Everest.  Your data management skills will be tested.

During the webinar, our experts all agreed: there is a peaceful coexistence that can happen between Hadoop and the data warehouse. They should be applied to the right workloads and share data as often as possible. When a workload is defined, it becomes clear that some data belongs in the data warehouse while other types of data may be more appropriate in Hadoop. Once you have put your data into its enterprise residence, each will feed their various applications.

In conclusion, being able to leverage a framework, such as the TCOD one that was discussed during the webinar, really lends itself to having a solid plan when approaching your big data challenges and to ultimately solving them.

Here are some additional resources for further information:

Total Cost of Data Webinar

Big Data—What Does It Really Cost? (white paper)

The Real Cost of Big Data (Spreadsheet)

TCOD presentation slides (PDF)