big data

Optimization in Data Modeling 1 – Primary Index Selection

Posted on: July 14th, 2015 by Guest Blogger No Comments

 

In my last blog I spoke about the decisions that must be made when transforming an Industry Data Model (iDM) from Logical Data Model (LDM) to an implementable Physical Data Model (PDM). However, being able to generate DDL (Data Definition Language) that will run on a Teradata platform is not enough – you also want it to perform well. While it is possible to generate DDL almost immediately from a Teradata iDM, each customer’s needs mandate that existing structures be reviewed against data and access demographics, so that optimal performance can be achieved.

Having detailed data and access path demographics during PDM design is critical to achieving great performance immediately, otherwise it’s simply guesswork. Alas, these are almost never available at the beginning of an installation, but that doesn’t mean you can’t make “excellent guesses.”

The single most influential factor in achieving PDM performance is proper Primary Index (PI) selection for warehouse tables. Data modelers are focused on entity/table Primary Keys (PK) since it is what defines uniqueness at the row level. Because of this, a lot of physical modelers tend to implement the PK as a Unique Primary Index (UPI) on each table as a default. But one of the keys to Teradata’s great performance is that it utilizes the PI to physical distribute data within a table across the entire platform to optimize parallelism. Each processor gets a piece of the table based on the PI, so rows from different tables with the same PI value are co-resident and do not need to be moved when two tables are joined.

In a Third Normal Form (3NF) model no two entities (outside of super/subtypes and rare exceptions) will have the same PK, so if chosen as a PI, it stands to reason that no two tables share a PI and every table join will require data from at least one table to be moved before a join can be completed – not a solid performance decision to say the least.

The iDM’s have preselected PI’s largely based on Identifiers common across subject areas (i.e. Party Id) so that all information regarding that ID will be co-resident and joins will be AMP-local. These non-unique PI’s (NUPI’s) are a great starting point for your PDM, but again need to be evaluated against customer data and access plans to insure that both performance and reasonably even data distribution is achieved.

Even data distribution across the Teradata platform is important since skewed data can contribute both to poor performance and to space allocation (run out of space on one AMP, run out of space on all). However, it can be overemphasized to the detriment of performance.

Say, for example, a table has a PI of PRODUCT_ID, and there are a disproportionate number of rows for several Products causing skewed distribution Altering the PI to the table PK instead will provide perfectly even distribution, but remember, when joining to that table, if all elements of the PK are not available then the rows of the table will need to be redistributed, most likely by PRODUCT_ID.

This puts them back under the AMP where they were in the skewed scenario. This time instead of a “rest state” skew the rows will skew during redistribution, and this will happen every time the table is joined to – not a solid performance decision. Optimum performance can therefore be achieved with sub-optimum distribution.

iDM tables relating two common identifiers will usually have one of the ID’s pre-selected as a NUPI. In some installations the access demographics will show that other ID may be the better choice. If so, change it! Or it may give leave you with no clear choice, in which case picking one is almost assuredly better than
changing the PI to a composite index consisting of both ID’s as this will only result in a table no longer co-resident with any table indexed by either of the ID’s alone.

There are many other factors that contribute to achieving optimal performance of your physical model, but they all pale in comparison to a well-chosen PI. In my next blog we’ll look at some more of these and discuss when and how best to implement them.

Jake Kurdsjuk Biopic-resize July 15

Jake Kurdsjuk is Product Manager for the Teradata Communications Industry Data Model, purchased by more than one hundred Communications Service Providers worldwide. Jake has been with Teradata since 2001 and has 25 years of experience working with Teradata within the Communications Industry, as a programmer, DBA, Data Architect and Modeler.

Why We Love Presto

Posted on: June 24th, 2015 by Daniel Abadi No Comments

 

Concurrent with acquiring Hadoop companies Hadapt and Revelytix last year, Teradata opened the Teradata Center for Hadoop in Boston. Teradata recently announced that a major new initiative of this Hadoop development center will include open-source contributions to a distributed SQL query engine called Presto. Presto was originally developed at Facebook, and is designed to run high performance, interactive queries against Big Data wherever it may live --- Hadoop, Cassandra, or traditional relational database systems.

Among those people who will be part of this initiative and contributing code to Presto include a subset of the Hadapt team that joined Teradata last year. In the following, we will dive deeper into the thinking behind this new initiative from the perspective of the Hadapt team. It is important to note upfront that Teradata’s interest in Presto, and the people contributing to the Presto codebase, extends beyond the Hadapt team that joined Teradata last year. Nonetheless, it is worthwhile to understand the technical reasoning behind the embrace of Presto from Teradata, even if it presents a localized view of the overall initiative.

Around seven years ago, Ashish Thusoo and his team at Facebook built the first SQL layer over Hadoop as part of a project called Hive. At its essence, Hive was a query translation layer over Hadoop: it received queries in a SQL-like language called Hive-QL, and transformed them into a set of MapReduce jobs over data stored in HDFS on a Hadoop cluster. Hive was truly the first project of its kind. However, since its focus was on query translation into the existing MapReduce query execution engine of Hadoop, it achieved tremendous scalability, but poor efficiency and performance, and ultimately led to a series of subsequent SQL-on-Hadoop solutions that claimed 100X speed-ups over Hive.

Hadapt was the first such SQL-on-Hadoop solution that claimed a 100X speed-up over Hive on certain types of queries. Hadapt was spun out of the HadoopDB research project from my team at Yale and was founded by a group of Yale graduates. The basic idea was to develop a hybrid system that is able to achieve the fault-tolerant scalability of the Hive MapReduce query execution engine while leveraging techniques from the parallel database system community to achieve high performance query processing.

The intention of HadoopDB/Hadapt was never to build its own query execution layer. The first version of Hadapt used a combination of PostgreSQL and MapReduce for distributed query execution. In particular, the query operators that could be run locally, without reliance on data located on other nodes in the cluster, were run using PostgreSQL’s query operator set (although Hadapt was written such that PostgreSQL could be replaced by any performant single-node database system). Meanwhile, query operators that required data exchange between multiple nodes in the cluster were run using Hadoop’s MapReduce engine.

Although Hadapt was 100X faster than Hive for long, complicated queries that involved hundreds of nodes, its reliance on Hadoop MapReduce for parts of query execution precluded sub-second response time for small, simple queries. Therefore, in 2012, Hadapt started to build a secondary query execution engine called “IQ” which was intended to be used for smaller queries. The idea was that all queries would be fed through a query-analyzer layer before execution. If the query was predicted to be long and complex, it would be fed through Hadapt’s original fault-tolerant MapReduce-based engine. However, if the query would complete in a few seconds or less, it would be fed to the IQ execution engine.

presto graphic blogIn 2013 Hadapt integrated IQ with Apache Tez in order avoid redundant programming efforts, since the primary goals of IQ and Tez were aligned. In particular, Tez was designed as an alternative to MapReduce that can achieve interactive performance for general data processing applications. Indeed, Hadapt was able to achieve interactive performance on a much wider-range of queries when leveraging Tez, than what it was able to achieve previously.

Figure 1: Intertwined Histories of SQL-on-Hadoop Technology

Unfortunately Tez was not quite a perfect fit as a query execution engine for Hadapt’s needs. The largest issue was that before shipping data over the network during distributed operators, Tez first writes this data to local disk. The overhead of writing this data to disk (especially when the size of the intermediate result set was large) precluded interactivity for a non-trivial subset of Hadapt’s query workload. A second problem is that the Hive query operators that are implemented over Tez use (by default) traditional Volcano-style row-by-row iteration. In other words, a single function-invocation for a query operator would process just a single database record. This resulted in a larger number of function calls required to process a large dataset, and poor instruction cache locality as the instructions associated with a particular operator were repeatedly reloaded into the instruction cache for each function invocation. Although Hive and Tez have started to alleviate this issue with the recent introduction of vectorized operators, Hadapt still found that query plans involving joins or SQL functions would fall back to row-by-row iteration.

The Hadapt team therefore decided to refocus its query execution strategy (for the interactive query part of Hadapt’s engine) to Presto, which presented several advantages over Tez. First, Presto pipelines data between distributed query operators directly, without writing to local disk, significantly improving performance for network-intensive queries. Second, Presto query operators are vectorized by default, thereby improving CPU efficiency and instruction cache locality. Third, Presto dynamically compiles selective query operators to byte code, which lets the JVM optimize and generate native machine code. Fourth, it uses direct memory management, thereby avoiding Java object allocations, its heap memory overhead and garbage collection pauses. Overall, Presto is a very advanced piece of software, and very much in line with Hadapt’s goal of leveraging as many techniques from modern parallel database system architecture as possible.

The Teradata Center for Hadoop has thus fully embraced Presto as the core part of its technology strategy for the execution of interactive queries over Hadoop. Consequently, it made logical sense for Teradata to take its involvement in the Presto to the next level. Furthermore, Hadoop is fundamentally an open source project, and in order to become a significant player in the Hadoop ecosystem, Teradata needs to contribute meaningful and important code to the open source community. Teradata’s recent acquisition of Think Big serves as further motivation for such contributions.

Therefore Teradata has announced that it is committed to making open source contributions to Presto, and has allocated substantial resources to doing so. Presto is already used by Silicon Valley stalwarts Facebook, AirBnB, NetFlix, DropBox, and Groupon. However, Presto’s enterprise adoption outside of silicon valley remains small. Part of the reason for this is that ease-of-use and enterprise features that are typically associated with modern commercial database systems are not fully available with Presto. Missing are an out-of the-box simple-to-use installer, database monitoring and administration tools, and third-party integrations. Therefore, Teradata’s initial contributions will focus in these areas, with the goal of bridging the gap to getting Presto widely deployed in traditional enterprise applications. This will hopefully lead to more contributors and momentum for Presto.

For now, Teradata’s new commitments to open source contributions in the Hadoop ecosystem are focused on Presto. Teradata is only committing to contribute a small amount of Hadapt code to open source --- in particular those parts that will further the immediate goal of transforming Presto into an enterprise-ready, easy-to-deploy piece of software. However, Teradata plans to monitor Presto’s progress and the impact of Teradata contributions. Teradata may ultimately decide to contribute more parts of Hadapt to the Hadoop open source community. At this point it is too early to speculate how this will play out.

Nonetheless, Teradata’s commitment to Presto and its commitment to making meaningful contributions to an open source project is an exciting development. It will likely have a significant impact on enterprise-adoption of Presto. Hopefully, Presto will become a widely used open source parallel query execution engine --- not just within the Hadoop community, but due to the generality of its design and its storage layer agnosticism, for relational data stored anywhere.

==================================================================================

daniel abadi crop BLOG bio mgmtDaniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). http://twitter.com/#!/daniel_abadi.

 

I recently participated in a business analytics project for non-profits that, as the planning progressed, seemed like a perfect opportunity to implement an agile approach, except that the work was to be completed in two days! But all the developers would be co-located. We had three objectives that fit the profile of user stories. We would cleanse, analyze, and report on the data and, hopefully, discover some insights. We would have the business stakeholders in the room with us the whole time. But doing all this in two days seemed like agile on steroids to me. And it reminded me of an old Stephen Wright joke, “I put instant coffee in the microwave and almost went back in time!”

So, if you put agile on steroids, can you go back in time? Well, maybe not, but we did accomplish a lot in those two days! The project was a DataDive, a collaboration between the non-profit, DataKind, and Teradata, that was held the two days before the Teradata Partners 2014 conference.

Blog data dive teamsI was a Data Ambassador paired with another Data Ambassador to work with a non-governmental organization (NGO) to prepare for the DataDive and make sure we reached our goals. The NGO that DataKind assigned us to was iCouldBe, an organization that provides on-line mentoring to at-risk kids at over 250 schools in the U.S. Since I am not a data scientist or analyst, I found my role as gathering requirements from the business stakeholders at iCouldBe. I worked with them to prioritize the requirements and identify the expected business value. Sounds like the product owner role in “Scrum” -- right? My partner Data Ambassador worked with the head of IT at iCouldBe to identify the data we needed and worked to prepare it for the data dive. This is similar to a Scrum project, where preparatory work must be completed to be ready for the first sprint.

DataKind wanted us to identify the tasks to accomplish each user story, so I immediately thought about using a task board for the actual DataDive. I created one ahead of time in Excel that identified the tasks for each user story as well as the development and handoff phases for each story. I didn’t realize it at the time, but I was creating a Kanban board (a portion of the board is shown in the picture) that allowed us to track workflow.

Blog - Data dive KanbanOnce I got to the DataDive, I recreated the Kanban board using flip chart paper and used sticky notes for the tasks, much the way it might be done for a real project. The user stories were listed in priority order from top to bottom. The tasks represented the metrics, dimensions, text and other analysis required to address the user stories. Some tasks supported multiple user stories, so we noted those and used that “re-use” to help prioritize. We placed these reusable tasks at the top of the board in the swimlane with the highest priority user story. (Click on the figure at left to enlarge - DataDive Kanban Board - Partial Workflow)

 

For example, the number of posts and words per post that mentors and mentees made in the online mentoring program was an important metric that iCouldBe wanted to calculate to help identify successful mentee completion of the program. Are mentees that write more posts and words per post more likely to complete the program? This question addresses the first user story. But number of posts and words per post can also be used to analyze the amount of engagement between mentors and their mentees and what areas of the curriculum need to be improved.

As the volunteers arrived, they chose tasks, focusing on the high priority tasks first, wrote their name on the sticky notes, and moved the note to the first development column, which was to review the available data.

blog data dive - whiteboardAt different times during the day, DataKind asked each team to review what they had done so far, and what they planned on doing next, similar to the daily standup in Scrum (and we actually did stand).

As the DataDive progressed to day two, only tasks for user stories 1 and 2 progressed across the board, but I reminded the team that some of the tasks we completed for the first two user stories also helped address the third user story. At the end of the DataDive, to better visually show this, I moved some of the sticky notes from user story 1 into the user story 3 swimlane. This way, we could show the business stakeholders from iCouldBe that, although we focused on the higher priority user stories 1 and 2, we had also partially addressed user story 3.

Although this project did not check all the boxes in being a standard agile implementation, it served as a great opportunity for me to put some agile practices in motion in a real project and learn from it. One of the most important aspects was the close collaboration between the developers and stakeholders. It was great to see how thrilled the stakeholders were with the work we had accomplished in just two days!

While I wish I could go back in time and do the DataDive all over again, as it was a great personal experience for me, instead I’ll look to the future and apply what I’ve learned from this project to my next agile project.

Blog ElissaElisia Getts is a Sr. Product Manager, Certified Scrum Master (CSM), and member of the Teradata Agile COE. She has been with Teradata for 15 years and has over 25 years of experience in IT as a product manager, business/IT consultant, programmer/analyst, and technical writer supporting industries such as travel and hospitality, transportation and logistics, and defense. She is the team’s expert on Scrum.

Your Big Data Initiative may not Require Logical Modeling

Posted on: May 12th, 2015 by Guest Blogger No Comments

 

By: Don Tonner

Logical Modeling may not be required on your next big data initiative.  From experience, I know when building things from scratch that a model reduces development costs, improves quality, and gets me to market quicker.  So why would I say you may not require logical modeling?

Most data modelers are employed in forward engineering activities in which the ultimate goal is to create a database or an application used by companies to manage their businesses.  The process is generally:

  • Obtain an understanding of the business concepts that the database will serve.
  • Organize the business information into structured data components and constraints—a logical model.
  • Create data stores based on the logical model and let the data population and manipulation begin.

Forward engineering is the act of going from requirements to a finished product. For databases that means starting with a detailed understanding of the information of the business, which is found largely in the minds and practices of the employees of the enterprise. This detailed understanding may be thought of as a conceptual model. Object Role Model diagramVarious methods have evolved to document this conceptual richness; one example is the Object Role Model.

The conceptual model (detailed understanding of the enterprise; not to be confused with a conceptual high level E/R diagram) is transformed into a logical data model, which organizes data into structures upon which relational algebra may be performed. The thinking here is very mathematical. Data can be manipulated mathematically the same way we can manipulate anything else mathematically. Just like you may write an equation that expresses how much material it might take for a 3D printer to create a lamp, you may write an equation to show the difference between the employee populations of two different corporate regions.

The image that most of us have of a data model is not equations, variables or valid operations, but is the visual representation of the structures that represent the variables. Below you can see structures as well as relationships which are a kind of constraint.

UData Structures and Relationshipsltimately these structures and constraints will be converted into data stores, such as tables, columns, indexes and data types, which will be populated with data that may be constrained by some business rules.

Massively parallel data storage architectures are becoming increasingly popular as they address the challenges of storing and manipulating almost unimaginable amounts of data.   The ability is to ingest data quickly is critical as volumes increase. One approach is receiving the data without prior verification of the structure. HDFS files or JSON datatypes are examples of storage that do not require knowledge of the structure prior to loading.

OK, imagine a project where millions of readings from hundreds of sensors from scores of machines are collected every shift, possibly into a data lake. Engineers discover that certain analytics performed on the machine data can potentially alert us to conditions that may warrant operator intervention. Data scientists will create several analytic metrics based on hourly aggregates of the sensor data. What’s the modeler’s role in all this?

The models you are going to use on your big data initiative likely already exist.  All you have to do is find them.

One thing would be to reverse engineer a model of the structures of the big data, which can provide visual clues to the meaning of the data. Keep in mind that big data sources may have rapidly changing schemas, so reverse engineering may have to occur periodically on the same source to gather potential new attributes. Also remember that a database of any kind is an imperfect representation of the logical model, which is itself an imperfect representation of the business. So there is much interpretation required to go from the reverse engineered model to a business understanding of the data.

One would also start reviewing an enterprise data model or the forward engineered data warehouse model. After all, while the big data analytic can help point out which engines are experiencing conditions that need attention, when you can match those engine analytics to the workload that day, the experience level of the operator, the time since the last maintenance, then you are greatly expanding the value of that analytic.

So how do you combine the data together from disparate platforms? A logical modeler in a forward engineering environment assures that all the common things have the same identifiers and data types and this is built into the system. That same skill set needs to be leveraged if there is going to be any success performing cross-platform analytics. The identifiers of the same things on the different platforms need to be cross validated in order to make apples to apples comparisons. If analytics are going to be captured and stored in the existing Equipment Scores section of the warehouse, the data will need to be transformed to the appropriate identifiers and data types. If the data is going to be joined on the fly via Teradata QueryGrid™, knowledge of these id’s and datatypes is essential for success and performance.

There are many other modern modeling challenges, let me know what has your attention.

Don Tonner, Teradata Architecture and Modeling Solutions team Don Tonner is a member of the Architecture and Modeling Solutions team, and has worked on several cool projects such as Teradata Mapping Manager, the unification modules, and Solution Modeling Building Blocks.  He is currently creating an Industry Dimensions development kit and working out how models might be useful when combining information from disparate platforms.  You can also reach him on Twitter, @BigDataDon.

Making SAP data relevant in the world of big data

Posted on: May 4th, 2015 by Patrick Teunissen No Comments

 

Part one of series about an old “SAP”dog who learns a new trick

Reflecting back on the key messages from Teradata Universe 2015 in April it was impossible to escape the theme of deriving differentiated business value leveraging the latest data sources and analytic techniques. I heard from several customers how they improved their business by combining their traditional warehouse data (or ‘SAP data’ for us old dogs) with other data from across the enterprise and applying advanced analytic techniques. A special interest group dedicated a whole morning exploring the value of integrating ‘SAP data’ with ‘other data’. As I sat thru these sessions, I found it funny that companies that run SAP ERP always speak about their data in terms of SAP data and other data. It made me wonder what is ‘other data’ and what makes it so special?

In most cases, ‘other data’ is referred to as ‘Big Data’. The term is quite ubiquitous and was used to describe just about every data source. But it’s important to note, that throughout the sessions I attended, none of the companies referred to their SAP data as Big Data. Big Data was a term reserved for the (relatively) new sources of data like machine generated data from the Internet of Things, call center details, POS related data, and social media/web logs.

Although not “big”, customers told me they considered their SAP ERP applications to be complex fortresses of data. In comparison to traditional data warehouses or big data stores, SAP is very difficult to extract and integrate with their ‘other data’. Even SAP acknowledges this shortcoming as evidenced by their recent programs to ‘Simplify’ their complex applications. But I’d argue that while SAP ERPs maybe complex to run, the data that is processed in these applications is quite simple. SAP experts would agree that if you know where to look, the data is both simple and reliable.

Unfortunately these experts live in a world of their own which is focused entirely on data that flows thru SAP. But as evidenced by the customers at Teradata Universe the lion’s share of new IT projects/ business initiatives are focused on leveraging ‘big data’. Which means the folks who know SAP are rarely involved in the IT projects involving ‘big data’, and vice versa, which explains the chasm between SAP and ‘other data’. The ‘Big Data’ folks don’t understand the valuable context that SAP brings. And the ‘SAP data’ folks don’t understand the new insights that analytics on the ‘other data’ can deliver.

However, the tides are turning and the general consensus has evolved to agree that there is value in brining SAP data together with big data. SAP ERP is used primarily for managing the transactional processes in the financial, logistics, manufacturing, and administration functions. This means it houses high quality master data, attribute data, and detailed facts about the business. Combining this structured and reliable data up to multi-structured big data can add valuable confidence and context to the analytics that matter most to businesses today!

Here’s a recent example where a customer integrated the results of advanced text analytics with their SAP ERP data within their Teradata warehouse. The data science team was experimenting with a number of Aster machine learning and natural language processing techniques to find meaning and insight in field technician reports. Using one of Aster’s text analytic methods, Latent Dirichlet Allocation, they were able to identify common related word groups within their reports to identify quality events such as “broken again” or “running as expected”. However they discovered unexpected insight regarding equipment suppliers and 3rd party service providers also hidden in the field reports, such as “Supplier XYZ is causing problems” or “ABC is easy to work with”. They were then able to integrate all of these relatable word groups with context from the SAP ERP purchasing history data stored in the warehouse to provide additional insight and enrichment to their supplier scores.

 

 

High Level Data Analytics Graph
(Healthcare Example)

 <---- Click on image to view GRAPH ANIMATION

Michael Porter, in an excellent article in the November 2014 issue of the Harvard Business Review[1], points out that smart connected products are broadening competitive boundaries to encompass related products that meet a broader underlying need. Porter elaborates that the boundary shift is not only from the functionality of discrete products to cross-functionality of product systems, but in many cases expanding to a system of systems such as a smart home or smart city.

So what does all this mean from a data perspective? In that same article, Porter mentions that companies seeking leadership need to invest in capturing, coordinating, and analyzing more extensive data across multiple products and systems (including external information). The key take-away is that the movement of gaining competitive advantage by searching for cross-functional or cross-system insights from data is only going to accelerate and not slow down. Exploiting cross-functional or cross-system centrality of data better than anyone else will continue to remain critical to achieving a sustainable competitive advantage.

Understandably, as technology changes, the mechanisms and architecture used to exploit this cross-system centrality of data will evolve. Current technology trends point to a need for a data & analytic-centric approach that leverages the right tool for the right job and orchestrates these technologies to mask complexity for the end users; while also managing complexity for IT in a hybrid environment. (See this article published in Teradata Magazine.)

As businesses embrace the data & analytic-centric approach, the following types of questions will need to be addressed: How can business and IT decide on when to combine which data and to what degree? What should be the degree of data integration (tight, loose, non-coupled)? Where should the data reside and what is the best data modeling approach (full, partial, need based)? What type of analytics should be applied on what data?

Of course, to properly address these questions, an architecture assessment is called for. But for the sake of going beyond the obvious, one exploratory data point in addressing such questions could be to measure and analyze the cross-functional/cross-system centrality of data.

By treating data and analytics as a network of interconnected nodes in Gephi[2], the connectedness between data and analytics can be measured and visualized for such exploration. We can examine a statistical metric called Degree Centrality[3] which is calculated based on how well an analytic node is connected.

The high level sample data analytics graph demonstrates the cross-functional Degree Centrality of analytics from an Industry specific perspective (Healthcare). It also amplifies, from an industry perspective, the need for organizations to build an analytical ecosystem that can easily harness this cross-functional Degree Centrality of data analytics. (Learn more about Teradata’s Unified Data Architecture.)

In the second part of this blog post series we will walk through a zoomed-in view of the graph, analyze the Degree Centrality measurements for sample analytics, and draw some high-level data architecture implications.

[1] https://hbr.org/2014/11/how-smart-connected-products-are-transforming-competition

[2] Gephi is a tool to explore and understand graphs. It is a complementary tool to traditional statistics.

[3] Degree centrality is defined as the number of links incident upon a node (i.e., the number of ties that a node has).

Ojustwin blog bio

Ojustwin Naik (MBA, JD) is a Director with 15 years of experience in planning, development, and delivery of Analytics. He has experience across multiple industries and is passionate at nurturing a culture of innovation based on clarity, context, and collaboration.

 

It is well-known that there are two extreme alternatives for storing database tables on any storage media: storing it row-by-row (as done by traditional “row-store” technology) or storing it column-by-column (as done by recently popular “column-store” implementations). Row-stores store the entire first row of the table, followed by the entire second row of the table, etc. Column-stores store the entire first column of the table, followed by the entire second column of the table, etc. There have been huge amounts of research literature and commercial whitepapers that discuss the various advantages of these alternative approaches, along with various proposals for hybrid solutions (which I discussed in more detail in my previous post).

abadi blog clamp image abadiDespite the many conflicting arguments in favor of these different approaches, there is little question that column-stores compress data much better than row-stores. The reason is fairly intuitive: in a column-store, entire columns are stored contiguously --- in other words, a series of values from the same attribute domain are stored consecutively. In a row-store, values from different attribute domains are interspersed, thereby reducing the self-similarity of the data. In general the more self-similarity (lower entropy) you have in a dataset, the more compressible it is. Hence, column-stores are more compressible than row-stores.

In general, compression rates are very sensitive to the particular dataset that is being compressed. Therefore it is impossible to make any kind of guarantees about how much a particular database system/compression algorithm will compress an arbitrary dataset. However, as a general rule of thumb, it is reasonable to expect around 8X compression if a column-store is used on many kinds of datasets. 8X compression means that the compressed dataset is 1/8th the original size, and scan-based queries over the dataset can thus proceed approximately 8 times as fast. This stellar compression and resulting performance improvements are a major contributor to the recent popularity of column-stores.

It is precisely this renowned compression of column-stores which makes the compression rate of RainStor (a recent Teradata acquisition) so impressive in comparison. RainStor claims a factor of 5 times more compression than what column-stores are able to achieve on the same datasets, and 40X compression overall.

Although the reason why column-stores compress data better than row-stores is fairly intuitive, the reason why RainStor can compress data better than column-stores is less intuitive. Therefore, we will now explain this in more detail.

Take for example the following table, which is a subset of a table describing orders from a particular retail enterprise that sells bicycles and related parts. (A real table would have many more rows and columns, but we keep this example simple so that it is easier to understand what is going on).

Record Order date Ship date Product Price
1 03/22/2015 03/23/2015 “bicycle” 300
2 03/22/2015 03/24/2015 “lock” 18
3 03/22/2015 03/24/2015 “tire” 70
4 03/22/2015 03/23/2015 “lock” 18
5 03/22/2015 03/24/2015 “bicycle” 250
6 03/22/2015 03/23/2015 “bicycle” 280
7 03/22/2015 03/23/2015 “tire” 70
8 03/22/2015 03/23/2015 “lock” 18
9 03/22/2015 03/24/2015 “bicycle” 280
10 03/23/2015 03/24/2015 “lock” 18
11 03/23/2015 03/25/2015 “bicycle” 300
12 03/23/2015 03/24/2015 “bicycle” 280
13 03/23/2015 03/24/2015 “tire” 70
14 03/23/2015 03/25/2015 “bicycle” 250
15 03/23/2015 03/25/2015 “bicycle” 280

 

The table contains 15 records and shows four attributes --- the order and ship dates of a particular product; the product that was purchased, and the purchase price. Note that there is a relationship between some of these columns --- in particular the ship date is usually 1 or 2 days after the order date, and that the price of various products are usually consistent across orders, but there may be slight variations in price depending on what coupons the customer used to make the purchase.

A column-store would likely use “run-length encoding” to compress the order date column. Since records are sorted by order date, this would compress the column to its near-minimum --- it can be compressed as (03/22/2015, 9); (03/23/2015, 6) --- which indicates that 03/22/2015 is repeated 9 straight times, followed by 03/23/2015 which is repeated 6 times. The ship date column, although not sorted, is still very compressible, as each value can be expressed using a small number of bits in terms of how much larger (or smaller) it is from the previous value in the column. However, the other two columns --- product and price --- would likely be compressed using a variant of dictionary compression, where each value is mapped to the minimal number of bits needed represent it. For large datasets, where there are many unique values for price (or even for product), the number of bits needed to represent a dictionary entry is non-trivial, and the same dictionary entry is repeated in the compressed dataset for every repeated value in the original dataset.

In contrast, in RainStor, every unique value in the dataset is stored once (and only once), and every record is represented as a binary tree, where a breadth-first traversal of the tree enables the reconstruction of the original record. For example, the table shown above is compressed in RainStor using the forest of binary trees shown below. There are 15 binary trees (each of the 15 roots of these trees are shown using the green circles at the top of the figure), corresponding to the 15 records in the original dataset.abadi forest trees blog

Forest of Binary Trees Compression

For example, the binary tree corresponding to record 1 is shown on the left side of the figure. The root points to two children --- the internal nodes “A” and “E”. In turn, node “A” points to 03/22/2015 (corresponding to the order date of record 1), and to 03/23/2015 (corresponding to the ship date of record 1). Node “E” points to “bicycle” (corresponding to the product of record 1) and “300” corresponding to the price of record 1).

Note that records 4, 6, and 7 also have an order date of 03/22/2015 and a ship date of 03/23/2015. Therefore, the roots of the binary trees corresponding to those records also point to internal node “A”. Similarly, note that record 11 also is associated with the purchase of a bicycle for $300. Therefore, the root for record 11 also points to internal node “E”.

These shared internal nodes are what makes RainStor’s compression algorithm fundamentally different from any algorithm that a column-store is capable of performing. Column-stores are forced to create dictionaries and search for patterns only within individual columns. In contrast, RainStor’s compression algorithm finds patterns across different columns --- identifying the relationship between ship date and order date and the relationship between product and price, and leveraging these relationships to share branches in the trees that are formed, thereby eliminating redundant information. RainStor thus has fundamentally more room to search for patterns in the dataset and compress data by referencing these patterns via the (compressed) location of the root of the shared branch.

For a traditional archiving solution, compression rate is arguably the most important feature (right up there with immutability). Indeed, RainStor’s compression algorithm enables it to be used for archival use-cases, and RainStor provides all of the additional features you would expect from an archiving solution: encryption, LDAP/AD/PAM/Kerberos/PCI authentication and security, audit trails and logging, retention rules, expiry policies, and integrated implementation of existing compliance standards (e.g. SEC 17a-4).

However, what brings RainStor to the next level in the archival solutions market is that it is an “active” archive, meaning that the data that is managed by RainStor can be queried at high performance. RainStor provides a mature SQL stack for native querying of compressed RainStor data, including ANSI SQL 1992 and 2003 parsers, and a full MPP query execution engine. For enterprises with Hadoop clusters, RainStor is fully integrated with the Cloudera and Hortonworks distributions of Hadoop --- RainStor compressed data files can be partitioned over a HDFS cluster, and queried in parallel with HiveQL (or MapReduce or Pig). Furthermore, RainStor integrates with YARN for resource management, with HCatalog for metadata management, and with Ambari for system monitoring and management.

The reason why most archival solutions are not “active” is that the compression algorithms used to reduce the data size before archival are so heavy-weight, that significant processing resources must be invested in decompressing the data before it can be queried. Therefore, it is preferable to leave the data archived in compressed form, and only decompress it at times of significant need. In general, a user should expect significant query performance reductions relative to querying uncompressed data, in order to account for the additional decompression time.

The beauty of RainStor’s compression algorithm is that even though it gets compression ratios comparable to other archival products, its compression algorithm is not so heavy-weight that the data must be decompressed prior to querying it. In particular, the binary tree structures shown above are actually fairly straightforward to perform query operations on directly, without requiring decompression prior to access. For example, a count distinct or a group-by operation can be performed via a scan of the leaves of the binary tees. Furthermore, selections can be performed via a reverse traversal of the binary trees from the leaves that match the selection predicate. In general, since there is a one-to-one mapping of records in the uncompressed dataset to the binary trees in RainStor’s compressed files, all query operations can be expressed in terms of operations on these binary trees. Therefore, RainStor queries can benefit from the I/O improvement of scanning in less data (due to the smaller size of the compressed files on disk/memory) without paying the decompression cost to fully decompress these compressed files after they are read from storage. This leads to RainStor’s claims of 2X-100X performance improvement on most queries --- an industry-leading claim in the archival market.

In short, RainStor’s strong claims around compression and performance are backed up by the technology that is used under the covers. Its compression algorithm is able to identify and remove redundancy both within and across columns. Furthermore, the resulting data structures produced by the algorithm are amenable to direct operation on the compressed data. This allows the compressed files to be queried at high performance, and positions RainStor as a leading active-archive solution.

_________________________________________________________________________

daniel abadi crop BLOG bio mgmt

Daniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). http://twitter.com/#!/daniel_abadi.

 

Top 4 Big Data Applications There has been much talk and glorification of big data and the revelations that this can bring a new day to competitive advantage, there hasn’t been as much talk about the specific issues and problems that organizations can now address with great strength and insight.

The following discusses four of the most valuable big data applications or uses of big data to bring value to the enterprise and to give organizations advantages in their competitive marketplace.

1. Big Data Applications and Enhanced Cyber Security

Cyber security should be first on the to do list of all enterprise IT and IT cyber security practitioners. When you discuss big data and security, it’s about the ability to gather massive amounts of data in order to discover insights that predict and help prevent cyber attacks. The opportunity for incredible results was always there, but now there have been huge leaps forward in technology. There are now tools and techniques that enable enterprises to stay ahead of the perpetrators. A combination of big data analytics with specific security technologies that yields today’s strongest cyber defense posture.

2. Get The Most Advantageous View of Your Customers

Construct a fuller 360o view of your customers by adding more data sources – internal, external, proprietary, open source. Paint a fuller picture, allowing the organization to better understand customers and find advantageous means of communicating with them. Understand what when and why they buy, why they don’t or what they might buy next time.

3. Improving The Data Warehouse To Improve Business Insights

Use big data applications to improve decision making. Data stored in many different systems can be brought together for greater access and better decision making. Fold in big data and leverage advanced data warehouse capabilities to increase operational efficiency – and enable new forms of analysis. Use new technologies like big data specific platforms to create the opportunity for analysis of disparate data types. More data and broader data sources yield insights for stronger competitive advantages.

4. Big Sensor Data And Big Advantages Using Big Data Applications

Consider the opportunities analyzing things like machine and sensor or operational data can do for improving customer service and overall business results. The boom in and current pervasiveness of IT machine data, sensors, meters, GPS devices and myriad more requires analysis and combination with pertinent internal and external data sources. By employing not so complicated big data analytics, organizations can gain real-time visibility into operations, mechanical situations, customer experiences, transactions and behavior. NCR now receives telematics data from devices around the globe to determine the health of the equipment. The benefit? NCR sends digital repair instructions remotely or sends technicians with the correct equipment, to the right device, at the right time. Downtime can be planned or even prevented.

The benefits of big data analytics and tailored big data applications are very real. These are just four of the top uses for the new wealth of data. Many organizations have found many advantages in their explorations with big data.

Learn more about Big Data Analytics.

 

PART FIVE: This is the last blog in my series about Near Real Time data acquisition from SAP. This final blog is co-written with Arno Luijten, who is one of Teradata’s lead engineers. He is instrumental in demystifying the secrets of the elusive SAP clustered and pooled tables.

There is a common misconception that the Pool and Cluster tables in SAP R/3 can only be deciphered by the SAP R/3 application server, giving them an almost mythical status. The phrase that is used all over the ‘help sites’ and forums is “A clustered and a pooled table cannot be read from outside SAP because certain data are clustered and pooled in one field”… which makes replicating these tables pretty pointless – right?

But what exactly are Pooled and Cluster tables in SAP R/3 anyway? We thought we would let SAP give us the answer and searched their public help pages (SAP Help Portal). But that yielded limited results, so we looked further -- (Googled cluster tables) and found the following explanation (Technopedia-link):

“Cluster tables are special types of tables present in the SAP data dictionary. They are logical tables maintained as records of the normal SAP tables, which are commonly known as transparent tables. A key advantage of using cluster tables is that data is stored in a compressed format, reducing memory space and the landscape network load for retrieving information from these tables.”

Reading further on the same page, there are six major bullet points describing the features - of which fiveof them basically tell you that what we did cannot be done. Luckily, we didn’t let this phase us!

We agree: the purpose of SAP cluster tables is to save space because of the huge volume of the data contained in these tables and the potential negative impact that this may have on the SAP R/3 application. We know this because the two most (in-) famously large Cluster tables are RFBLG and KOCLU which contain the financial transactions and price conditions. SAP’s ABAP programmers often refer to BSEG (for financial transactions) and KONV (for the price-conditions).

From the database point of view, these tables do not exist but are contained in the physical tables named RFBLG and KOCLU. Typically these (ABAP) tables contain a lot of data. There are more tables set up in this way, but from a data warehousing point of view these two tables are probably the most relevant. Simply skipping these tables would not be a viable option for most clients.

Knowing the importance of the Pool and Cluster table, the value of data replication, and the value of operational analytics, we forged ahead with a solution. The encoded data from the ABAP table is stored as a BLOB “Binary Large Object” in the actual cluster table. To decode the data in the BLOB we wrote a C++ program as a Teradata User Defined Function (UDF) which we call the “Decoder” and it is installed directly within the Teradata database.

There can be a huge volume of data present in the cluster tables (hence the usage of cluster logic) and as a result decoding can be a lot of work and can have an impact on the performance of the SAP system. Here we have an extra advantage over SAP R/3 because the Decoder effectively allows us to bypass the ABAP layer and use the power of the Teradata Database. Our MPP capabilities allow decoding to be done massively faster than the SAP application, so decoding the RFBLG/KOCLU tables in Teradata can save a lot of time.

Over the last few months I have written about data replication starting with a brief SAP history, I questioned real-time systems, and I have written about the benefits of data replication and how it is disruptive to analytics for SAP R/3.

In my last blog I looked at the technical complexities we have had to overcome to build a complete data replication solution into Teradata Analytics for SAP® Solutions. It has not been a trivial exercise - but the benefits are huge!

Our Data Replication capability enables operational reporting and managerial analytics from the same source; it increases flexibility, significantly reduces the burden on the SAP R/3 system(s), and of course, delivers SAP data in near-real time for analytics.