big data

 

High Level Data Analytics Graph
(Healthcare Example)

 <---- Click on image to view GRAPH ANIMATION

Michael Porter, in an excellent article in the November 2014 issue of the Harvard Business Review[1], points out that smart connected products are broadening competitive boundaries to encompass related products that meet a broader underlying need. Porter elaborates that the boundary shift is not only from the functionality of discrete products to cross-functionality of product systems, but in many cases expanding to a system of systems such as a smart home or smart city.

So what does all this mean from a data perspective? In that same article, Porter mentions that companies seeking leadership need to invest in capturing, coordinating, and analyzing more extensive data across multiple products and systems (including external information). The key take-away is that the movement of gaining competitive advantage by searching for cross-functional or cross-system insights from data is only going to accelerate and not slow down. Exploiting cross-functional or cross-system centrality of data better than anyone else will continue to remain critical to achieving a sustainable competitive advantage.

Understandably, as technology changes, the mechanisms and architecture used to exploit this cross-system centrality of data will evolve. Current technology trends point to a need for a data & analytic-centric approach that leverages the right tool for the right job and orchestrates these technologies to mask complexity for the end users; while also managing complexity for IT in a hybrid environment. (See this article published in Teradata Magazine.)

As businesses embrace the data & analytic-centric approach, the following types of questions will need to be addressed: How can business and IT decide on when to combine which data and to what degree? What should be the degree of data integration (tight, loose, non-coupled)? Where should the data reside and what is the best data modeling approach (full, partial, need based)? What type of analytics should be applied on what data?

Of course, to properly address these questions, an architecture assessment is called for. But for the sake of going beyond the obvious, one exploratory data point in addressing such questions could be to measure and analyze the cross-functional/cross-system centrality of data.

By treating data and analytics as a network of interconnected nodes in Gephi[2], the connectedness between data and analytics can be measured and visualized for such exploration. We can examine a statistical metric called Degree Centrality[3] which is calculated based on how well an analytic node is connected.

The high level sample data analytics graph demonstrates the cross-functional Degree Centrality of analytics from an Industry specific perspective (Healthcare). It also amplifies, from an industry perspective, the need for organizations to build an analytical ecosystem that can easily harness this cross-functional Degree Centrality of data analytics. (Learn more about Teradata’s Unified Data Architecture.)

In the second part of this blog post series we will walk through a zoomed-in view of the graph, analyze the Degree Centrality measurements for sample analytics, and draw some high-level data architecture implications.

[1] https://hbr.org/2014/11/how-smart-connected-products-are-transforming-competition

[2] Gephi is a tool to explore and understand graphs. It is a complementary tool to traditional statistics.

[3] Degree centrality is defined as the number of links incident upon a node (i.e., the number of ties that a node has).

Ojustwin blog bio

Ojustwin Naik (MBA, JD) is a Director with 15 years of experience in planning, development, and delivery of Analytics. He has experience across multiple industries and is passionate at nurturing a culture of innovation based on clarity, context, and collaboration.

 

It is well-known that there are two extreme alternatives for storing database tables on any storage media: storing it row-by-row (as done by traditional “row-store” technology) or storing it column-by-column (as done by recently popular “column-store” implementations). Row-stores store the entire first row of the table, followed by the entire second row of the table, etc. Column-stores store the entire first column of the table, followed by the entire second column of the table, etc. There have been huge amounts of research literature and commercial whitepapers that discuss the various advantages of these alternative approaches, along with various proposals for hybrid solutions (which I discussed in more detail in my previous post).

abadi blog clamp image abadiDespite the many conflicting arguments in favor of these different approaches, there is little question that column-stores compress data much better than row-stores. The reason is fairly intuitive: in a column-store, entire columns are stored contiguously --- in other words, a series of values from the same attribute domain are stored consecutively. In a row-store, values from different attribute domains are interspersed, thereby reducing the self-similarity of the data. In general the more self-similarity (lower entropy) you have in a dataset, the more compressible it is. Hence, column-stores are more compressible than row-stores.

In general, compression rates are very sensitive to the particular dataset that is being compressed. Therefore it is impossible to make any kind of guarantees about how much a particular database system/compression algorithm will compress an arbitrary dataset. However, as a general rule of thumb, it is reasonable to expect around 8X compression if a column-store is used on many kinds of datasets. 8X compression means that the compressed dataset is 1/8th the original size, and scan-based queries over the dataset can thus proceed approximately 8 times as fast. This stellar compression and resulting performance improvements are a major contributor to the recent popularity of column-stores.

It is precisely this renowned compression of column-stores which makes the compression rate of RainStor (a recent Teradata acquisition) so impressive in comparison. RainStor claims a factor of 5 times more compression than what column-stores are able to achieve on the same datasets, and 40X compression overall.

Although the reason why column-stores compress data better than row-stores is fairly intuitive, the reason why RainStor can compress data better than column-stores is less intuitive. Therefore, we will now explain this in more detail.

Take for example the following table, which is a subset of a table describing orders from a particular retail enterprise that sells bicycles and related parts. (A real table would have many more rows and columns, but we keep this example simple so that it is easier to understand what is going on).

Record Order date Ship date Product Price
1 03/22/2015 03/23/2015 “bicycle” 300
2 03/22/2015 03/24/2015 “lock” 18
3 03/22/2015 03/24/2015 “tire” 70
4 03/22/2015 03/23/2015 “lock” 18
5 03/22/2015 03/24/2015 “bicycle” 250
6 03/22/2015 03/23/2015 “bicycle” 280
7 03/22/2015 03/23/2015 “tire” 70
8 03/22/2015 03/23/2015 “lock” 18
9 03/22/2015 03/24/2015 “bicycle” 280
10 03/23/2015 03/24/2015 “lock” 18
11 03/23/2015 03/25/2015 “bicycle” 300
12 03/23/2015 03/24/2015 “bicycle” 280
13 03/23/2015 03/24/2015 “tire” 70
14 03/23/2015 03/25/2015 “bicycle” 250
15 03/23/2015 03/25/2015 “bicycle” 280

 

The table contains 15 records and shows four attributes --- the order and ship dates of a particular product; the product that was purchased, and the purchase price. Note that there is a relationship between some of these columns --- in particular the ship date is usually 1 or 2 days after the order date, and that the price of various products are usually consistent across orders, but there may be slight variations in price depending on what coupons the customer used to make the purchase.

A column-store would likely use “run-length encoding” to compress the order date column. Since records are sorted by order date, this would compress the column to its near-minimum --- it can be compressed as (03/22/2015, 9); (03/23/2015, 6) --- which indicates that 03/22/2015 is repeated 9 straight times, followed by 03/23/2015 which is repeated 6 times. The ship date column, although not sorted, is still very compressible, as each value can be expressed using a small number of bits in terms of how much larger (or smaller) it is from the previous value in the column. However, the other two columns --- product and price --- would likely be compressed using a variant of dictionary compression, where each value is mapped to the minimal number of bits needed represent it. For large datasets, where there are many unique values for price (or even for product), the number of bits needed to represent a dictionary entry is non-trivial, and the same dictionary entry is repeated in the compressed dataset for every repeated value in the original dataset.

In contrast, in RainStor, every unique value in the dataset is stored once (and only once), and every record is represented as a binary tree, where a breadth-first traversal of the tree enables the reconstruction of the original record. For example, the table shown above is compressed in RainStor using the forest of binary trees shown below. There are 15 binary trees (each of the 15 roots of these trees are shown using the green circles at the top of the figure), corresponding to the 15 records in the original dataset.abadi forest trees blog

Forest of Binary Trees Compression

For example, the binary tree corresponding to record 1 is shown on the left side of the figure. The root points to two children --- the internal nodes “A” and “E”. In turn, node “A” points to 03/22/2015 (corresponding to the order date of record 1), and to 03/23/2015 (corresponding to the ship date of record 1). Node “E” points to “bicycle” (corresponding to the product of record 1) and “300” corresponding to the price of record 1).

Note that records 4, 6, and 7 also have an order date of 03/22/2015 and a ship date of 03/23/2015. Therefore, the roots of the binary trees corresponding to those records also point to internal node “A”. Similarly, note that record 11 also is associated with the purchase of a bicycle for $300. Therefore, the root for record 11 also points to internal node “E”.

These shared internal nodes are what makes RainStor’s compression algorithm fundamentally different from any algorithm that a column-store is capable of performing. Column-stores are forced to create dictionaries and search for patterns only within individual columns. In contrast, RainStor’s compression algorithm finds patterns across different columns --- identifying the relationship between ship date and order date and the relationship between product and price, and leveraging these relationships to share branches in the trees that are formed, thereby eliminating redundant information. RainStor thus has fundamentally more room to search for patterns in the dataset and compress data by referencing these patterns via the (compressed) location of the root of the shared branch.

For a traditional archiving solution, compression rate is arguably the most important feature (right up there with immutability). Indeed, RainStor’s compression algorithm enables it to be used for archival use-cases, and RainStor provides all of the additional features you would expect from an archiving solution: encryption, LDAP/AD/PAM/Kerberos/PCI authentication and security, audit trails and logging, retention rules, expiry policies, and integrated implementation of existing compliance standards (e.g. SEC 17a-4).

However, what brings RainStor to the next level in the archival solutions market is that it is an “active” archive, meaning that the data that is managed by RainStor can be queried at high performance. RainStor provides a mature SQL stack for native querying of compressed RainStor data, including ANSI SQL 1992 and 2003 parsers, and a full MPP query execution engine. For enterprises with Hadoop clusters, RainStor is fully integrated with the Cloudera and Hortonworks distributions of Hadoop --- RainStor compressed data files can be partitioned over a HDFS cluster, and queried in parallel with HiveQL (or MapReduce or Pig). Furthermore, RainStor integrates with YARN for resource management, with HCatalog for metadata management, and with Ambari for system monitoring and management.

The reason why most archival solutions are not “active” is that the compression algorithms used to reduce the data size before archival are so heavy-weight, that significant processing resources must be invested in decompressing the data before it can be queried. Therefore, it is preferable to leave the data archived in compressed form, and only decompress it at times of significant need. In general, a user should expect significant query performance reductions relative to querying uncompressed data, in order to account for the additional decompression time.

The beauty of RainStor’s compression algorithm is that even though it gets compression ratios comparable to other archival products, its compression algorithm is not so heavy-weight that the data must be decompressed prior to querying it. In particular, the binary tree structures shown above are actually fairly straightforward to perform query operations on directly, without requiring decompression prior to access. For example, a count distinct or a group-by operation can be performed via a scan of the leaves of the binary tees. Furthermore, selections can be performed via a reverse traversal of the binary trees from the leaves that match the selection predicate. In general, since there is a one-to-one mapping of records in the uncompressed dataset to the binary trees in RainStor’s compressed files, all query operations can be expressed in terms of operations on these binary trees. Therefore, RainStor queries can benefit from the I/O improvement of scanning in less data (due to the smaller size of the compressed files on disk/memory) without paying the decompression cost to fully decompress these compressed files after they are read from storage. This leads to RainStor’s claims of 2X-100X performance improvement on most queries --- an industry-leading claim in the archival market.

In short, RainStor’s strong claims around compression and performance are backed up by the technology that is used under the covers. Its compression algorithm is able to identify and remove redundancy both within and across columns. Furthermore, the resulting data structures produced by the algorithm are amenable to direct operation on the compressed data. This allows the compressed files to be queried at high performance, and positions RainStor as a leading active-archive solution.

_________________________________________________________________________

daniel abadi crop BLOG bio mgmt

Daniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). http://twitter.com/#!/daniel_abadi.

 

Top 4 Big Data Applications There has been much talk and glorification of big data and the revelations that this can bring a new day to competitive advantage, there hasn’t been as much talk about the specific issues and problems that organizations can now address with great strength and insight.

The following discusses four of the most valuable big data applications or uses of big data to bring value to the enterprise and to give organizations advantages in their competitive marketplace.

1. Big Data Applications and Enhanced Cyber Security

Cyber security should be first on the to do list of all enterprise IT and IT cyber security practitioners. When you discuss big data and security, it’s about the ability to gather massive amounts of data in order to discover insights that predict and help prevent cyber attacks. The opportunity for incredible results was always there, but now there have been huge leaps forward in technology. There are now tools and techniques that enable enterprises to stay ahead of the perpetrators. A combination of big data analytics with specific security technologies that yields today’s strongest cyber defense posture.

2. Get The Most Advantageous View of Your Customers

Construct a fuller 360o view of your customers by adding more data sources – internal, external, proprietary, open source. Paint a fuller picture, allowing the organization to better understand customers and find advantageous means of communicating with them. Understand what when and why they buy, why they don’t or what they might buy next time.

3. Improving The Data Warehouse To Improve Business Insights

Use big data applications to improve decision making. Data stored in many different systems can be brought together for greater access and better decision making. Fold in big data and leverage advanced data warehouse capabilities to increase operational efficiency – and enable new forms of analysis. Use new technologies like big data specific platforms to create the opportunity for analysis of disparate data types. More data and broader data sources yield insights for stronger competitive advantages.

4. Big Sensor Data And Big Advantages Using Big Data Applications

Consider the opportunities analyzing things like machine and sensor or operational data can do for improving customer service and overall business results. The boom in and current pervasiveness of IT machine data, sensors, meters, GPS devices and myriad more requires analysis and combination with pertinent internal and external data sources. By employing not so complicated big data analytics, organizations can gain real-time visibility into operations, mechanical situations, customer experiences, transactions and behavior. NCR now receives telematics data from devices around the globe to determine the health of the equipment. The benefit? NCR sends digital repair instructions remotely or sends technicians with the correct equipment, to the right device, at the right time. Downtime can be planned or even prevented.

The benefits of big data analytics and tailored big data applications are very real. These are just four of the top uses for the new wealth of data. Many organizations have found many advantages in their explorations with big data.

Learn more about Big Data Analytics.

 

PART FIVE: This is the last blog in my series about Near Real Time data acquisition from SAP. This final blog is co-written with Arno Luijten, who is one of Teradata’s lead engineers. He is instrumental in demystifying the secrets of the elusive SAP clustered and pooled tables.

There is a common misconception that the Pool and Cluster tables in SAP R/3 can only be deciphered by the SAP R/3 application server, giving them an almost mythical status. The phrase that is used all over the ‘help sites’ and forums is “A clustered and a pooled table cannot be read from outside SAP because certain data are clustered and pooled in one field”… which makes replicating these tables pretty pointless – right?

But what exactly are Pooled and Cluster tables in SAP R/3 anyway? We thought we would let SAP give us the answer and searched their public help pages (SAP Help Portal). But that yielded limited results, so we looked further -- (Googled cluster tables) and found the following explanation (Technopedia-link):

“Cluster tables are special types of tables present in the SAP data dictionary. They are logical tables maintained as records of the normal SAP tables, which are commonly known as transparent tables. A key advantage of using cluster tables is that data is stored in a compressed format, reducing memory space and the landscape network load for retrieving information from these tables.”

Reading further on the same page, there are six major bullet points describing the features - of which fiveof them basically tell you that what we did cannot be done. Luckily, we didn’t let this phase us!

We agree: the purpose of SAP cluster tables is to save space because of the huge volume of the data contained in these tables and the potential negative impact that this may have on the SAP R/3 application. We know this because the two most (in-) famously large Cluster tables are RFBLG and KOCLU which contain the financial transactions and price conditions. SAP’s ABAP programmers often refer to BSEG (for financial transactions) and KONV (for the price-conditions).

From the database point of view, these tables do not exist but are contained in the physical tables named RFBLG and KOCLU. Typically these (ABAP) tables contain a lot of data. There are more tables set up in this way, but from a data warehousing point of view these two tables are probably the most relevant. Simply skipping these tables would not be a viable option for most clients.

Knowing the importance of the Pool and Cluster table, the value of data replication, and the value of operational analytics, we forged ahead with a solution. The encoded data from the ABAP table is stored as a BLOB “Binary Large Object” in the actual cluster table. To decode the data in the BLOB we wrote a C++ program as a Teradata User Defined Function (UDF) which we call the “Decoder” and it is installed directly within the Teradata database.

There can be a huge volume of data present in the cluster tables (hence the usage of cluster logic) and as a result decoding can be a lot of work and can have an impact on the performance of the SAP system. Here we have an extra advantage over SAP R/3 because the Decoder effectively allows us to bypass the ABAP layer and use the power of the Teradata Database. Our MPP capabilities allow decoding to be done massively faster than the SAP application, so decoding the RFBLG/KOCLU tables in Teradata can save a lot of time.

Over the last few months I have written about data replication starting with a brief SAP history, I questioned real-time systems, and I have written about the benefits of data replication and how it is disruptive to analytics for SAP R/3.

In my last blog I looked at the technical complexities we have had to overcome to build a complete data replication solution into Teradata Analytics for SAP® Solutions. It has not been a trivial exercise - but the benefits are huge!

Our Data Replication capability enables operational reporting and managerial analytics from the same source; it increases flexibility, significantly reduces the burden on the SAP R/3 system(s), and of course, delivers SAP data in near-real time for analytics.

The Value of Big Data Unlocked

Posted on: March 17th, 2015 by Chris Twogood No Comments

 

The Value of Big Data UnlockedIt’s on every enterprise list of Things To Tackle in 2015. It’s every organization’s technological priority because it’s commonly considered important to future growth and competitive positioning. The value of Big data is big news —without a doubt.

Even though most business executives think realizing benefits from the value of big data is long overdue, look at the low participation figures:

  •  According to a recent IDG Enterprise survey, only 14 percent of respondents said that their enterprises had already deployed big data solutions3
  • Only 44 percent of enterprises report their organizations are in the planning or implementation stage of big data solutions1
  • A significant 85 percent of executives surveyed reported facing significant obstacles in dealing with big data, including security issues, a shortage of trained staff, and the need to develop new internal capabilities.

Arguably, all the hesitation links back to the complexity of the data and finding the solutions to manage it. If there’s so much upside, why aren’t more companies further along in their efforts to exploit their big data? Most organizations have not yet acquired the technology or expertise required to unravel the complexity much less leverage data to its full potential.

Today, in an effort to realize real world benefits (value from big data) semi-structured data such as social profiles and Twitter feeds join unstructured data like images and PDFs to add intelligence to an organization’s structured data from its traditional databases. To further exacerbate the complexity of the situation, big data is generated at high velocity and collected at frequent intervals, making the volume of the new data types nearly unmanageable.

Additionally, businesses need to unlock existing data from silos and gain a holistic view of all the new information so that they can make unique associations and ask important questions about customers and products. They need a technology solution that integrates their data stores, identifies behavior patterns, and draws meaningful associations and inferences. It’s important to understand that the value of big data will go beyond sophisticated reporting. It will advance from historical insight to being highly predictive, enabling managers to make the best decisions possible.

Teradata has created an advanced solution which deals with all the complexities and hurdles. It seamlessly integrates the variety, volume and velocity of the data in an integrated or unified data architecture. It bridges the hurdles found with difficult programming languages, extreme processing needs and customized data storage. Put simply, it provides a high-performance big data analytics system easily appreciated by both IT professionals and real-world business users.

Because Teradata’s Unified Data ArchitectureTM lets business users ingest and process data, it makes it faster to discover insights and act upon them.

With the majority of organizations just beginning to get their feet wet, there is still sizeable competitive advantage to be gained from unlocking insights from the almost limitless cache of data. Real world experiences reveal real world advantages:

Average Fortune 1000 companies can increase annual net income by $65.67 Million with an increase of just 10 percent in data accessibility2

  •  Top retailers have increased operating margins by 60 percent through monitoring customers’ in-store movements and combining that data with transaction records to determine optimal product placement, product mix and pricing3
  •  The U.S. healthcare industry stands to add $300 billion in revenues by leveraging big data4
  • Financial institutions are reducing customer churn by using data analytics to evaluate consumer and criminal behavior.

A solution like Teradata’s Unified Data Architecture – where users can ask any question at any time to unlock new and valuable business insights – is a painless catalyst to discovering new competitive advantages and profit. Every organization desires higher productivity, lower costs, and an expanded horizon of new opportunities. It’s a big advantage to be able open up discovery to users across the enterprise, not only the IT elite.

Learn more about Teradata’s Unified Data Architecture.

1. http://www.idgenterprise.com/press/big-data-initiatives-high-priority-for-enterprises-but-majority-will-face- implementation-challenges

2. http://www.forbes.com/sites/ciocentral/2012/07/09/will-big-data-actually-live-up-to-its-promise/2/
3. http://www.truaxis.com/blog/12764/big-profits-from-big-data/
4. http://www.information-management.com/news/big-data-ROI-Nucleus-automation-predictive-10022435-1.html

Hybrid Row-Column Stores: A General and Flexible Approach

Posted on: March 10th, 2015 by Daniel Abadi No Comments

 

During a recent meeting with a post-doc in my lab at Yale, he reminded me that this summer will mark the 10-year anniversary of the publication of C-Store in VLDB 2005. C-Store was by no means the first ever column-store database system (the column-store idea has been around since the 70s --- nearly as long as relational database systems), but it was quite possibly the first proposed architecture of a column-store designed for petabyte-scale data analysis. The C-Store paper has been extremely influential, with close to every major database vendor developing column-oriented extensions to their core database product in the past 10 years, with most of them citing C-Store (along with other influential systems) in their corresponding research white-papers about their column-oriented features.

Given my history with the C-Store project, I surprised a lot of people when some of my subsequent projects such as HadoopDB/Hadapt did not start with a column-oriented storage system from the beginning. For example, industry analyst Curt Monash repeatedly made fun of me on this topic (see, e.g. http://www.dbms2.com/2012/10/16/hadapt-version-2/).

In truth, my love and passion for column-stores has not diminished since 2005. I still believe that every analytical database system should have a column-oriented storage option. However, it is naïve to think that column-oriented storage is always the right solution. For some workloads --- especially those that scan most rows of a table but only a small subset of the columns, column-stores are clearly preferable. On the other hand, there any many workloads that contain very selective predicates and require access to the entire tuple for those rows which pass the predicate. For such workloads, row-stores are clearly preferable.

abadi new March 10 first graphicThere is thus general consensus in the database industry that a hybrid approach is needed. A database system should have both column-oriented and row-oriented storage options, and the optimal storage can be utilized depending on the expected workload.

Despite this consensus around the general idea of the need for a hybrid approach, there is a glaring lack of consensus about how to implement the hybrid approach. There have been many different proposals for how to build hybrid row/column-oriented database systems in the research and commercial literature. A sample of such proposals include:

(1) A fractured mirrors approach where the same data is replicated twice --- once in a column-oriented storage layer and once in a row-oriented storage layer. For any particular query, data is extracted from the optimal storage layer for that query, and processed by the execution engine.
(2) A column-oriented simulation within a row-store. Let’s say table X contains n columns. X gets replaced by n new tables, where each new table contains two columns --- (1) a row-identifier column and (2) the column values for one of the n columns in the original table. Queries are processed by joining together on the fly the particular set of these two-column tables that correspond to the columns accessed by that query.
(3) A “PAX” approach where each page/block of data contains data for all columns of a table, but data is stored column-by-column within the page/block.
(4) A column-oriented index approach where the base data is stored in a row-store, but column-oriented storage and execution can be achieved through the use of indexes.
(5) A table-oriented hybrid approach where a database administrator (DBA) is given a choice to store each table row-by-row or column-by-column, and the DBA makes a decision based on how they expect the tables to be used.
In the rest of this post, I will overview Teradata’s elegant hybrid row/column-store design and attempt to explain why I believe it is more flexible than the above-mentioned approaches.

The flexibility of Teradata’s approach is characterized by three main contributions.

1: Teradata views the row-store vs. column-store debate as two extremes in a more general storage option space.

The row-store extreme stores each row continuously on storage and the column-store extreme stores each column continuously on storage. In other words, row-stores maintain locality of horizontal access of a table, and column-stores maintain locality of vertical access of table. In general however, the optimal access-locality could be on a rectangular region of a table.

adadi second graphic March 10

Figure 1: Row and Column Stores (uncompressed)

To understand this idea, take the following example. Many workloads have frequent predicates on date attributes. By partitioning the rows of a table according to date (e.g. one partition per day, week, month, quarter, or year), those queries that contain predicates on date can be accelerated by eliminating all partitions corresponding to dates outside the range of the query, thereby efficiently utilizing I/O to read in data from the table from only those partitions that have data matching the requested data range.

However, different queries may analyze different table attributes for a given date range. For example, one query may examine the total revenue brought in per store in the last quarter, while another query may examine the most popular pairs of widgets bought together in each product category in the last quarter. The optimal storage layout for such queries would be to have store and revenue columns stored together in the same partition, and to have product and product category columns stored together in the same partition. Therefore we want both column-partitions (store and revenue in one partition and product and product category in a different partition) and row-partitions (by date).

This arbitrary partitioning of a table by both rows and columns results in a set of rectangular partitions, each partition containing a subset of rows and columns from the original table. This is far more flexible than a “pure” column-store that enforces that each column be stored in a different physical or virtual partition.

Note that allowing arbitrary rectangular partitioning of a table is a more general approach than a pure column-store or a pure row-store. A column-store is simply a special type of rectangular partitioning where each partition is a long, narrow rectangle around a single column of data. Row-oriented storage can also be simulated with special types of rectangles. Therefore, by supporting arbitrary rectangular partitioning, Teradata is able to support “pure” column-oriented storage, “pure” row-oriented storage, and many other types of storage between these two extremes.

2: Teradata can physically store each rectangular partition in “row-format” or “column-format.”

One oft-cited advantage of column-stores is that for columns containing fixed-width values, the entire column can be represented as a single array of values. The row id for any particular element in the array can be determined directly by the index of the element within the array. Accessing a column in an array-format can lead to significant performance benefits, including reducing I/O and leveraging the SIMD instruction set on modern CPUs, since expression or predicate evaluation can occur in parallel on multiple array elements at once.

Another oft-cited advantage of column-stores (especially within my own research --- see e.g. http://db.csail.mit.edu/projects/cstore/abadisigmod06.pdf ) is that column-stores compress data much better than row-stores because there is more self-similarity (lower entropy) of data within a column than across columns, since each value within a column is drawn from the same attribute domain. Furthermore, it is not uncommon to see the same value repeat multiple times consecutively within a column, in which case the column can be compressed using run-length encoding --- a particularly useful type of compression since it can both result in high compression ratios and also is trivial to operate on directly, without requiring decompression of the data.

Both of these advantages of column-stores are supported in Teradata when the column-format is used for storage within a partition. In particular, multiple values of a column (or a small group of columns) are stored continuously in an array within a Teradata data structure called a “container”. Each container comes with a header indicating the row identifier of the first value within the container, and the row identifiers of every other value in the container can be deduced by adding their relative position within the container to the row identifier of the first value. Each container is automatically compressed using the optimal column-oriented compression format for that data, including run-length encoding the data when possible.

abadi third graphic March 10

Figure 2: Column-format storage using containers.

However, one disadvantage of not physically storing the row identifier next to each value is that extraction of a value given a row identifier requires more work, since additional calculations must be performed to extract the correct value from the container. In some cases, these additional calculations involve just positional offsetting; however, in some cases, the compressed bits of the container have to be scanned in order to extract the correct value. Therefore Teradata also supports traditional row-format storage within each partition, where the row identifier is explicitly stored alongside any column values associated with that row. When partitions are stored using this “row format”, Teradata’s traditional mechanisms for quickly finding a row given a row identifier can be leveraged.

In general, when the rectangular partitioning scheme results in wide rectangles, row format storage is recommended, since the overhead of storing the row id with each row is amortized across the breadth of the row, and the benefits of array-oriented iteration through the data are minimal. But when the partitioning scheme results in narrow rectangles, column-format storage is recommended, in order to get the most out of column-oriented array iteration and compression. Either way --- having a choice between row format and column format for each partition further improves the flexibility of Teradata’s row/columnar hybrid scheme.

3: Teradata enables traditional primary indexing for quick row-access even when column-oriented partitioning is used.

Many column-stores do not support primary indexes due to the complexity involved in moving around records as a result of new inserts into the index. In contrast, Teradata Database 15.10 supports two types of primary indexing when a table has been partitioned to AMPs (logical servers) by the hash value of the primary index attribute. The first, called CPPI, maintains all row and column partitions on an AMP sorted by the hash value of the primary index attribute. These hash values are stored within the row identifier for the record, which enables each column partition to independently maintain the same sort order without explicitly communicating with each other. Since the data is sorted by the hash of the primary index attribute, finding particular records for a given value of the primary index attribute is extremely fast. The second, called CPPA, does not sort by the hash of the primary index attribute --- therefore the AMP that contains a particular record can be quickly identified given a value of the primary index attribute. However, further searching is necessary within the AMP to find the particular record. This searching is limited to the non-eliminated, nonempty column and row partitions. Finding a particular record given a row id for both CPPI and CPPA is extremely fast since, in either case, the records are in row id order.

Combined, these three features make Teradata’s hybrid solution to the row-store vs. column-store tradeoff extremely general and flexible. In fact, it’s possible to argue that there does not exist a more flexible hybrid solution from a major vendor on the market. Teradata has also developed significant flexibility inside its execution engine --- adapting to column-format vs. row-format input automatically, and using optimal query execution methods depending on the format-type that a particular query reads from.
====================================================================================

daniel abadi crop BLOG bio mgmt

Daniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). http://twitter.com/#!/daniel_abadi.

 

(Part 2 of a post illustrating how marketers are creatively leveraging big data to secure competitive advantages.)

Big data leveraged into insights have a strong likelihood to distinguish organizations from their competitors. Because of the infancy of this movement, few big data insights to date have been turned into marketing advantages - so early entrants into big data marketing have a distinct advantage. Consider the following big data marketing examples for a view into how other early adopter enterprises have sought advantage from big data:

1. Next Generation Customer Retargeting

As big data analytics become more sophisticated, marketers will find better ways to retarget customers. Imagine, for example, retargeting based on items that are viewed online but not clicked on. This and other tactics will provide more customizable methods than the retargeting currently being used.

2. Use Heat Map Technology to Track In-Store Customer Preferences

Use on premise camera systems with a heat map technology to view in-store customer traffic – just as websites use technology to register online activity. This offline traffic information can be contrasted with online data to tell retailers how products perform online versus offline in order to adjust marketing programs.

3. Leverage Geospatial Data to Communicate with Customers

Use geospatial data to prepare targeted offers AND drive online customers to store locations. Wireless carriers have increased revenue per user with targeted marketing campaigns and combined offline and online marketing efforts.

4. Analyze Social Media to Increase Revenue

Use social network analysis to identify and impact influential customers. Wireless carriers have found that by implementing social analysis they can increase the revenue that their top 10 percent of influential customers impact – from 35 percent to an impressive 80 percent.

5. Focus On Conversions

Marketers should talk in the language of conversions and place their focus there. “What is the source of leads that has the highest conversion?” “What type of content inspires the strongest brand advocates?” “Which channels host the highest rate of conversions.” Use big data to inform and drive all aspects of conversions.

Look at 6 Ways Big Data Marketing Helps Companies Be Competitive: Part 1 for more examples of leveraging big data marketing.

 

6 Ways Big Data Marketing Helps Companies Be Competitive: Part 1

Posted on: February 24th, 2015 by Chris Twogood No Comments

 

Big data – business changing data – is giving marketers new ways to be innovative and step ahead of competitors. A creative strategy or advertising campaign is only scratching the surface of mechanisms available today to drive revenue. Effective CMOs must appreciate the power of new and diverse data sources and demand marketing directors interpret and use statistical business and customer insights to create smart strategies and quality predictive analysis.

Understanding some of the more clever big data marketing examples helps to illustrate how marketers should be thinking analytically and creatively with non-traditional data. Consider the following big data marketing examples:

1. Measure Social Media Impact

Companies can measure the impact of social media with custom analytics solutions or social network analysis.

2. Identify Your Brand Evangelists

Identify alpha influencers and use these individuals in active marketing campaigns. Find alpha influencers not just through traditional transactions (recent purchases, customer service calls) but also through social media.

3. Translate Big Data Insights into Actionable Marketing Tactics

Translate big data insights into actionable marketing tactics with teams of different disciplines. The most successful are teams that work fast and are highly iterative – business, IT, and analytics specialists rapidly review real-world findings, recalibrate analyses, adjust assumptions, and then test outcomes.

4. Create Customer Buying Projections

Use historic behavioral data for a defined target as an indicator for behavior against a different category of product offering. For example, test payment history or upgrade likelihood for a utility service as indicators of behavior for an entertainment offering or emerging credit offering. Test into success.

5. Understand True Value of Different Marketing Channel

Combine sales data from traditional media and social-media sites to create a model that highlights the impact of traditional media versus activity reflected on social media (like call center interactions). Bad customer experiences are more powerful sales drivers than traditional media activity. Spending behind improving customer service can be more effective than funding advertising – to drive revenue.

6. Pinpoint Sales Opportunities by Zip Code

Rather than overloading sales reps with reams of data and complex models to interpret, create powerful sales tools with simple, visual interfaces that pinpoint new-customer potential by zip code. It’s a proven tactic for increased sales.

Look for Part 2 for more examples of clever examples of big data marketing. In the meantime, see other big data examples.

 

Data-Driven Design: Smart Modeling in the Fast Lane

Posted on: February 24th, 2015 by Guest Blogger 2 Comments

 

In this blog, I would like to discuss a different way of modeling data regardless of the method such as Third Normal Form or Dimensional or Analytical datasets. This new way of data modeling will cut down the development cycles by avoiding rework, be agile, and produce higher quality solutions. It’s a discipline that looks at requirements and data as input into the design.

A lot of organizations have struggled getting the data model correct, especially for application, which has a big impact on different phases of the system development lifecycle. Generally, we elicit requirements first where the IT team and business users together create a business requirements document (BRD).

Business users explain business rules and how source data should be transformed into something they can use and understand. We then create a data model using the BRD and produce a technical requirements documentation which is then used to develop the code. Sometimes it takes us over 9 months before we start looking at the source data. This delay in engaging data almost every time causes rework since the design was based only on requirements. The other extreme end of this is when a design is based only on data.

We have always either based the design solely on requirements or data but hardly ever using both methods. We should give the business users what they want and yet be mindful of the realities of data.

It has been almost impossible to employ both methods for different reasons such as traditional waterfall method where BDUF (Big Design Up Front) is introduced without ever looking at the data. Other reasons are we work with data but the data is either created for proof of concept or testing which is farther from the realities of production data. To do this correctly, we need JIT (Just in Time) or good enough requirements and then get into the data quickly and mold our design based on both the requirements and data.

The idea is to get into the data quickly and validate the business rules and assumptions made by business users. Data-driven design is about engaging the data early. It is more than data profiling, as data-driven design inspects and adapts in context of the target design. As we model our design, we immediately begin loading data into it, often by day one or two of the sprint. That is the key.

Early in the sprint, data-driven design marries the perspective of the source data to the perspective of the business requirements to identify gaps, transformation needs, quality issues, and opportunities to expand our design. End users generally know about the day to day business but are not aware of the data.

The data-driven design concept can be used whether an organization is practicing waterfall or agile methodology. It obviously fits very nicely with the agile methodologies and Scrum principles such as inspect and adapt. We inspect the data and adapt the design accordingly. Using DDD we can test the coverage and fit of the target schema, from the analytical user perspective. By encouraging the design and testing of target schema using real data in quick, iterative cycles, the development team can ensure that target schema designed for implementation have been thoroughly reviewed, tested and approved by end-users before project build begins.

Case Study: While working with a mega-retailer, in one of the projects I was decomposing business questions. We were working with promotions and discounts subject area and we had two metrics: Promotion Sales Amount and Commercial Sales Amount. Any item that was sold as part of a promotion is counted towards Promotion Sales and any item that is sold as regular is counted towards Commercial Sales. Please note that Discount Amount and Promotion Sales Amount are two very different metrics. While decomposing, the business user described that each line item within a transaction (header) would have the discount amount evenly proportioned.

Data driven design graphicFor example – Let’s say there is a promotion where if you buy 3 bottles of wine then you get 2 bottles free. In this case, according to the business user, there would be discount amount evenly proportioned across the 5 line items - thus indicating that these 5 line items are on promotion and we can count the sales of these 5 line items toward Promotion Sales Amount.

This wasn’t the case when the team validated this scenario against the data. We discovered that the discount amount was only present for the “get” items and not for the “buy” items. Using our example, discount amount was provided for the 2 free bottles (get) and not for 3 bottles (buy). This makes it hard to calculate Promotion Sales Amount for the 3 “buy” items since it wasn’t known if the customer just bought 3 items or 5 items unless we looked at all the records, which was in millions every day.

What if the customer bought 6 bottles of wine so ideally 5 lines are on promotion and the 6th line (diagram above) is commercial sales or regular sales? Looking at the source data there was no way of knowing which transaction lines are part of promotion and which aren’t.

After this discovery, we had to let the business users know about the inaccuracy for calculating Promotion Sales Amount. Proactively, we designed a new fact to accommodate for the reality of data. There were more complicated scenarios that the team discovered that the business user hadn’t thought of.

In the example above, we had the same item for “buy” and “get” which was wine. We found a scenario, where a customer bought a 6 pack of beer then got a glass free. This further adds to the complexity. After validating the business rules against source data, we had to request additional data for “buy” and “get” list to properly calculate Promotion Sales Amount.

Imagine finding out that you need additional source data to satisfy business requirements nine months into the project. Think about change request for data model, development, testing etc. With DDD, we found this out within days and adapted to the “data realities” within the same week. The team also discovered that the person at the POS system could either pick up a wine bottle and times it by 7 or he could “beep” each bottle one by one. This inconsistency makes a lot of difference such as one record versus 7 records in the source feed.

There were other discoveries we made along the way as we got into the data and designed the target schema while keeping the reality of the data in mind. We were also able to ensure that the source system has the right available grain that the business users required.

Grover Sachin bio pic blog small

Sachin Grover leads the Teradata Agile group within Teradata. He has been with Teradata for 5 years and has worked on development of Solution Modeling Building Blocks and helped define best practices for semantic data models on Teradata. He has over 10 years of experience in IT industry as a BI / DW architect, modeler, designer, analyst, developer and tester.