big data

 

PART FIVE: This is the last blog in my series about Near Real Time data acquisition from SAP. This final blog is co-written with Arno Luijten, who is one of Teradata’s lead engineers. He is instrumental in demystifying the secrets of the elusive SAP clustered and pooled tables.

There is a common misconception that the Pool and Cluster tables in SAP R/3 can only be deciphered by the SAP R/3 application server, giving them an almost mythical status. The phrase that is used all over the ‘help sites’ and forums is “A clustered and a pooled table cannot be read from outside SAP because certain data are clustered and pooled in one field”… which makes replicating these tables pretty pointless – right?

But what exactly are Pooled and Cluster tables in SAP R/3 anyway? We thought we would let SAP give us the answer and searched their public help pages (SAP Help Portal). But that yielded limited results, so we looked further -- (Googled cluster tables) and found the following explanation (Technopedia-link):

“Cluster tables are special types of tables present in the SAP data dictionary. They are logical tables maintained as records of the normal SAP tables, which are commonly known as transparent tables. A key advantage of using cluster tables is that data is stored in a compressed format, reducing memory space and the landscape network load for retrieving information from these tables.”

Reading further on the same page, there are six major bullet points describing the features - of which fiveof them basically tell you that what we did cannot be done. Luckily, we didn’t let this phase us!

We agree: the purpose of SAP cluster tables is to save space because of the huge volume of the data contained in these tables and the potential negative impact that this may have on the SAP R/3 application. We know this because the two most (in-) famously large Cluster tables are RFBLG and KOCLU which contain the financial transactions and price conditions. SAP’s ABAP programmers often refer to BSEG (for financial transactions) and KONV (for the price-conditions).

From the database point of view, these tables do not exist but are contained in the physical tables named RFBLG and KOCLU. Typically these (ABAP) tables contain a lot of data. There are more tables set up in this way, but from a data warehousing point of view these two tables are probably the most relevant. Simply skipping these tables would not be a viable option for most clients.

Knowing the importance of the Pool and Cluster table, the value of data replication, and the value of operational analytics, we forged ahead with a solution. The encoded data from the ABAP table is stored as a BLOB “Binary Large Object” in the actual cluster table. To decode the data in the BLOB we wrote a C++ program as a Teradata User Defined Function (UDF) which we call the “Decoder” and it is installed directly within the Teradata database.

There can be a huge volume of data present in the cluster tables (hence the usage of cluster logic) and as a result decoding can be a lot of work and can have an impact on the performance of the SAP system. Here we have an extra advantage over SAP R/3 because the Decoder effectively allows us to bypass the ABAP layer and use the power of the Teradata Database. Our MPP capabilities allow decoding to be done massively faster than the SAP application, so decoding the RFBLG/KOCLU tables in Teradata can save a lot of time.

Over the last few months I have written about data replication starting with a brief SAP history, I questioned real-time systems, and I have written about the benefits of data replication and how it is disruptive to analytics for SAP R/3.

In my last blog I looked at the technical complexities we have had to overcome to build a complete data replication solution into Teradata Analytics for SAP® Solutions. It has not been a trivial exercise - but the benefits are huge!

Our Data Replication capability enables operational reporting and managerial analytics from the same source; it increases flexibility, significantly reduces the burden on the SAP R/3 system(s), and of course, delivers SAP data in near-real time for analytics.

The Value of Big Data Unlocked

Posted on: March 17th, 2015 by Chris Twogood No Comments

 

The Value of Big Data UnlockedIt’s on every enterprise list of Things To Tackle in 2015. It’s every organization’s technological priority because it’s commonly considered important to future growth and competitive positioning. The value of Big data is big news —without a doubt.

Even though most business executives think realizing benefits from the value of big data is long overdue, look at the low participation figures:

  •  According to a recent IDG Enterprise survey, only 14 percent of respondents said that their enterprises had already deployed big data solutions3
  • Only 44 percent of enterprises report their organizations are in the planning or implementation stage of big data solutions1
  • A significant 85 percent of executives surveyed reported facing significant obstacles in dealing with big data, including security issues, a shortage of trained staff, and the need to develop new internal capabilities.

Arguably, all the hesitation links back to the complexity of the data and finding the solutions to manage it. If there’s so much upside, why aren’t more companies further along in their efforts to exploit their big data? Most organizations have not yet acquired the technology or expertise required to unravel the complexity much less leverage data to its full potential.

Today, in an effort to realize real world benefits (value from big data) semi-structured data such as social profiles and Twitter feeds join unstructured data like images and PDFs to add intelligence to an organization’s structured data from its traditional databases. To further exacerbate the complexity of the situation, big data is generated at high velocity and collected at frequent intervals, making the volume of the new data types nearly unmanageable.

Additionally, businesses need to unlock existing data from silos and gain a holistic view of all the new information so that they can make unique associations and ask important questions about customers and products. They need a technology solution that integrates their data stores, identifies behavior patterns, and draws meaningful associations and inferences. It’s important to understand that the value of big data will go beyond sophisticated reporting. It will advance from historical insight to being highly predictive, enabling managers to make the best decisions possible.

Teradata has created an advanced solution which deals with all the complexities and hurdles. It seamlessly integrates the variety, volume and velocity of the data in an integrated or unified data architecture. It bridges the hurdles found with difficult programming languages, extreme processing needs and customized data storage. Put simply, it provides a high-performance big data analytics system easily appreciated by both IT professionals and real-world business users.

Because Teradata’s Unified Data ArchitectureTM lets business users ingest and process data, it makes it faster to discover insights and act upon them.

With the majority of organizations just beginning to get their feet wet, there is still sizeable competitive advantage to be gained from unlocking insights from the almost limitless cache of data. Real world experiences reveal real world advantages:

Average Fortune 1000 companies can increase annual net income by $65.67 Million with an increase of just 10 percent in data accessibility2

  •  Top retailers have increased operating margins by 60 percent through monitoring customers’ in-store movements and combining that data with transaction records to determine optimal product placement, product mix and pricing3
  •  The U.S. healthcare industry stands to add $300 billion in revenues by leveraging big data4
  • Financial institutions are reducing customer churn by using data analytics to evaluate consumer and criminal behavior.

A solution like Teradata’s Unified Data Architecture – where users can ask any question at any time to unlock new and valuable business insights – is a painless catalyst to discovering new competitive advantages and profit. Every organization desires higher productivity, lower costs, and an expanded horizon of new opportunities. It’s a big advantage to be able open up discovery to users across the enterprise, not only the IT elite.

Learn more about Teradata’s Unified Data Architecture.

1. http://www.idgenterprise.com/press/big-data-initiatives-high-priority-for-enterprises-but-majority-will-face- implementation-challenges

2. http://www.forbes.com/sites/ciocentral/2012/07/09/will-big-data-actually-live-up-to-its-promise/2/
3. http://www.truaxis.com/blog/12764/big-profits-from-big-data/
4. http://www.information-management.com/news/big-data-ROI-Nucleus-automation-predictive-10022435-1.html

Hybrid Row-Column Stores: A General and Flexible Approach

Posted on: March 10th, 2015 by Daniel Abadi No Comments

 

During a recent meeting with a post-doc in my lab at Yale, he reminded me that this summer will mark the 10-year anniversary of the publication of C-Store in VLDB 2005. C-Store was by no means the first ever column-store database system (the column-store idea has been around since the 70s --- nearly as long as relational database systems), but it was quite possibly the first proposed architecture of a column-store designed for petabyte-scale data analysis. The C-Store paper has been extremely influential, with close to every major database vendor developing column-oriented extensions to their core database product in the past 10 years, with most of them citing C-Store (along with other influential systems) in their corresponding research white-papers about their column-oriented features.

Given my history with the C-Store project, I surprised a lot of people when some of my subsequent projects such as HadoopDB/Hadapt did not start with a column-oriented storage system from the beginning. For example, industry analyst Curt Monash repeatedly made fun of me on this topic (see, e.g. http://www.dbms2.com/2012/10/16/hadapt-version-2/).

In truth, my love and passion for column-stores has not diminished since 2005. I still believe that every analytical database system should have a column-oriented storage option. However, it is naïve to think that column-oriented storage is always the right solution. For some workloads --- especially those that scan most rows of a table but only a small subset of the columns, column-stores are clearly preferable. On the other hand, there any many workloads that contain very selective predicates and require access to the entire tuple for those rows which pass the predicate. For such workloads, row-stores are clearly preferable.

abadi new March 10 first graphicThere is thus general consensus in the database industry that a hybrid approach is needed. A database system should have both column-oriented and row-oriented storage options, and the optimal storage can be utilized depending on the expected workload.

Despite this consensus around the general idea of the need for a hybrid approach, there is a glaring lack of consensus about how to implement the hybrid approach. There have been many different proposals for how to build hybrid row/column-oriented database systems in the research and commercial literature. A sample of such proposals include:

(1) A fractured mirrors approach where the same data is replicated twice --- once in a column-oriented storage layer and once in a row-oriented storage layer. For any particular query, data is extracted from the optimal storage layer for that query, and processed by the execution engine.
(2) A column-oriented simulation within a row-store. Let’s say table X contains n columns. X gets replaced by n new tables, where each new table contains two columns --- (1) a row-identifier column and (2) the column values for one of the n columns in the original table. Queries are processed by joining together on the fly the particular set of these two-column tables that correspond to the columns accessed by that query.
(3) A “PAX” approach where each page/block of data contains data for all columns of a table, but data is stored column-by-column within the page/block.
(4) A column-oriented index approach where the base data is stored in a row-store, but column-oriented storage and execution can be achieved through the use of indexes.
(5) A table-oriented hybrid approach where a database administrator (DBA) is given a choice to store each table row-by-row or column-by-column, and the DBA makes a decision based on how they expect the tables to be used.
In the rest of this post, I will overview Teradata’s elegant hybrid row/column-store design and attempt to explain why I believe it is more flexible than the above-mentioned approaches.

The flexibility of Teradata’s approach is characterized by three main contributions.

1: Teradata views the row-store vs. column-store debate as two extremes in a more general storage option space.

The row-store extreme stores each row continuously on storage and the column-store extreme stores each column continuously on storage. In other words, row-stores maintain locality of horizontal access of a table, and column-stores maintain locality of vertical access of table. In general however, the optimal access-locality could be on a rectangular region of a table.

adadi second graphic March 10

Figure 1: Row and Column Stores (uncompressed)

To understand this idea, take the following example. Many workloads have frequent predicates on date attributes. By partitioning the rows of a table according to date (e.g. one partition per day, week, month, quarter, or year), those queries that contain predicates on date can be accelerated by eliminating all partitions corresponding to dates outside the range of the query, thereby efficiently utilizing I/O to read in data from the table from only those partitions that have data matching the requested data range.

However, different queries may analyze different table attributes for a given date range. For example, one query may examine the total revenue brought in per store in the last quarter, while another query may examine the most popular pairs of widgets bought together in each product category in the last quarter. The optimal storage layout for such queries would be to have store and revenue columns stored together in the same partition, and to have product and product category columns stored together in the same partition. Therefore we want both column-partitions (store and revenue in one partition and product and product category in a different partition) and row-partitions (by date).

This arbitrary partitioning of a table by both rows and columns results in a set of rectangular partitions, each partition containing a subset of rows and columns from the original table. This is far more flexible than a “pure” column-store that enforces that each column be stored in a different physical or virtual partition.

Note that allowing arbitrary rectangular partitioning of a table is a more general approach than a pure column-store or a pure row-store. A column-store is simply a special type of rectangular partitioning where each partition is a long, narrow rectangle around a single column of data. Row-oriented storage can also be simulated with special types of rectangles. Therefore, by supporting arbitrary rectangular partitioning, Teradata is able to support “pure” column-oriented storage, “pure” row-oriented storage, and many other types of storage between these two extremes.

2: Teradata can physically store each rectangular partition in “row-format” or “column-format.”

One oft-cited advantage of column-stores is that for columns containing fixed-width values, the entire column can be represented as a single array of values. The row id for any particular element in the array can be determined directly by the index of the element within the array. Accessing a column in an array-format can lead to significant performance benefits, including reducing I/O and leveraging the SIMD instruction set on modern CPUs, since expression or predicate evaluation can occur in parallel on multiple array elements at once.

Another oft-cited advantage of column-stores (especially within my own research --- see e.g. http://db.csail.mit.edu/projects/cstore/abadisigmod06.pdf ) is that column-stores compress data much better than row-stores because there is more self-similarity (lower entropy) of data within a column than across columns, since each value within a column is drawn from the same attribute domain. Furthermore, it is not uncommon to see the same value repeat multiple times consecutively within a column, in which case the column can be compressed using run-length encoding --- a particularly useful type of compression since it can both result in high compression ratios and also is trivial to operate on directly, without requiring decompression of the data.

Both of these advantages of column-stores are supported in Teradata when the column-format is used for storage within a partition. In particular, multiple values of a column (or a small group of columns) are stored continuously in an array within a Teradata data structure called a “container”. Each container comes with a header indicating the row identifier of the first value within the container, and the row identifiers of every other value in the container can be deduced by adding their relative position within the container to the row identifier of the first value. Each container is automatically compressed using the optimal column-oriented compression format for that data, including run-length encoding the data when possible.

abadi third graphic March 10

Figure 2: Column-format storage using containers.

However, one disadvantage of not physically storing the row identifier next to each value is that extraction of a value given a row identifier requires more work, since additional calculations must be performed to extract the correct value from the container. In some cases, these additional calculations involve just positional offsetting; however, in some cases, the compressed bits of the container have to be scanned in order to extract the correct value. Therefore Teradata also supports traditional row-format storage within each partition, where the row identifier is explicitly stored alongside any column values associated with that row. When partitions are stored using this “row format”, Teradata’s traditional mechanisms for quickly finding a row given a row identifier can be leveraged.

In general, when the rectangular partitioning scheme results in wide rectangles, row format storage is recommended, since the overhead of storing the row id with each row is amortized across the breadth of the row, and the benefits of array-oriented iteration through the data are minimal. But when the partitioning scheme results in narrow rectangles, column-format storage is recommended, in order to get the most out of column-oriented array iteration and compression. Either way --- having a choice between row format and column format for each partition further improves the flexibility of Teradata’s row/columnar hybrid scheme.

3: Teradata enables traditional primary indexing for quick row-access even when column-oriented partitioning is used.

Many column-stores do not support primary indexes due to the complexity involved in moving around records as a result of new inserts into the index. In contrast, Teradata Database 15.10 supports two types of primary indexing when a table has been partitioned to AMPs (logical servers) by the hash value of the primary index attribute. The first, called CPPI, maintains all row and column partitions on an AMP sorted by the hash value of the primary index attribute. These hash values are stored within the row identifier for the record, which enables each column partition to independently maintain the same sort order without explicitly communicating with each other. Since the data is sorted by the hash of the primary index attribute, finding particular records for a given value of the primary index attribute is extremely fast. The second, called CPPA, does not sort by the hash of the primary index attribute --- therefore the AMP that contains a particular record can be quickly identified given a value of the primary index attribute. However, further searching is necessary within the AMP to find the particular record. This searching is limited to the non-eliminated, nonempty column and row partitions. Finding a particular record given a row id for both CPPI and CPPA is extremely fast since, in either case, the records are in row id order.

Combined, these three features make Teradata’s hybrid solution to the row-store vs. column-store tradeoff extremely general and flexible. In fact, it’s possible to argue that there does not exist a more flexible hybrid solution from a major vendor on the market. Teradata has also developed significant flexibility inside its execution engine --- adapting to column-format vs. row-format input automatically, and using optimal query execution methods depending on the format-type that a particular query reads from.
====================================================================================

daniel abadi crop BLOG bio mgmt

Daniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). http://twitter.com/#!/daniel_abadi.

 

(Part 2 of a post illustrating how marketers are creatively leveraging big data to secure competitive advantages.)

Big data leveraged into insights have a strong likelihood to distinguish organizations from their competitors. Because of the infancy of this movement, few big data insights to date have been turned into marketing advantages - so early entrants into big data marketing have a distinct advantage. Consider the following big data marketing examples for a view into how other early adopter enterprises have sought advantage from big data:

1. Next Generation Customer Retargeting

As big data analytics become more sophisticated, marketers will find better ways to retarget customers. Imagine, for example, retargeting based on items that are viewed online but not clicked on. This and other tactics will provide more customizable methods than the retargeting currently being used.

2. Use Heat Map Technology to Track In-Store Customer Preferences

Use on premise camera systems with a heat map technology to view in-store customer traffic – just as websites use technology to register online activity. This offline traffic information can be contrasted with online data to tell retailers how products perform online versus offline in order to adjust marketing programs.

3. Leverage Geospatial Data to Communicate with Customers

Use geospatial data to prepare targeted offers AND drive online customers to store locations. Wireless carriers have increased revenue per user with targeted marketing campaigns and combined offline and online marketing efforts.

4. Analyze Social Media to Increase Revenue

Use social network analysis to identify and impact influential customers. Wireless carriers have found that by implementing social analysis they can increase the revenue that their top 10 percent of influential customers impact – from 35 percent to an impressive 80 percent.

5. Focus On Conversions

Marketers should talk in the language of conversions and place their focus there. “What is the source of leads that has the highest conversion?” “What type of content inspires the strongest brand advocates?” “Which channels host the highest rate of conversions.” Use big data to inform and drive all aspects of conversions.

Look at 6 Ways Big Data Marketing Helps Companies Be Competitive: Part 1 for more examples of leveraging big data marketing.

 

6 Ways Big Data Marketing Helps Companies Be Competitive: Part 1

Posted on: February 24th, 2015 by Chris Twogood No Comments

 

Big data – business changing data – is giving marketers new ways to be innovative and step ahead of competitors. A creative strategy or advertising campaign is only scratching the surface of mechanisms available today to drive revenue. Effective CMOs must appreciate the power of new and diverse data sources and demand marketing directors interpret and use statistical business and customer insights to create smart strategies and quality predictive analysis.

Understanding some of the more clever big data marketing examples helps to illustrate how marketers should be thinking analytically and creatively with non-traditional data. Consider the following big data marketing examples:

1. Measure Social Media Impact

Companies can measure the impact of social media with custom analytics solutions or social network analysis.

2. Identify Your Brand Evangelists

Identify alpha influencers and use these individuals in active marketing campaigns. Find alpha influencers not just through traditional transactions (recent purchases, customer service calls) but also through social media.

3. Translate Big Data Insights into Actionable Marketing Tactics

Translate big data insights into actionable marketing tactics with teams of different disciplines. The most successful are teams that work fast and are highly iterative – business, IT, and analytics specialists rapidly review real-world findings, recalibrate analyses, adjust assumptions, and then test outcomes.

4. Create Customer Buying Projections

Use historic behavioral data for a defined target as an indicator for behavior against a different category of product offering. For example, test payment history or upgrade likelihood for a utility service as indicators of behavior for an entertainment offering or emerging credit offering. Test into success.

5. Understand True Value of Different Marketing Channel

Combine sales data from traditional media and social-media sites to create a model that highlights the impact of traditional media versus activity reflected on social media (like call center interactions). Bad customer experiences are more powerful sales drivers than traditional media activity. Spending behind improving customer service can be more effective than funding advertising – to drive revenue.

6. Pinpoint Sales Opportunities by Zip Code

Rather than overloading sales reps with reams of data and complex models to interpret, create powerful sales tools with simple, visual interfaces that pinpoint new-customer potential by zip code. It’s a proven tactic for increased sales.

Look for Part 2 for more examples of clever examples of big data marketing. In the meantime, see other big data examples.

 

Data-Driven Design: Smart Modeling in the Fast Lane

Posted on: February 24th, 2015 by Guest Blogger 2 Comments

 

In this blog, I would like to discuss a different way of modeling data regardless of the method such as Third Normal Form or Dimensional or Analytical datasets. This new way of data modeling will cut down the development cycles by avoiding rework, be agile, and produce higher quality solutions. It’s a discipline that looks at requirements and data as input into the design.

A lot of organizations have struggled getting the data model correct, especially for application, which has a big impact on different phases of the system development lifecycle. Generally, we elicit requirements first where the IT team and business users together create a business requirements document (BRD).

Business users explain business rules and how source data should be transformed into something they can use and understand. We then create a data model using the BRD and produce a technical requirements documentation which is then used to develop the code. Sometimes it takes us over 9 months before we start looking at the source data. This delay in engaging data almost every time causes rework since the design was based only on requirements. The other extreme end of this is when a design is based only on data.

We have always either based the design solely on requirements or data but hardly ever using both methods. We should give the business users what they want and yet be mindful of the realities of data.

It has been almost impossible to employ both methods for different reasons such as traditional waterfall method where BDUF (Big Design Up Front) is introduced without ever looking at the data. Other reasons are we work with data but the data is either created for proof of concept or testing which is farther from the realities of production data. To do this correctly, we need JIT (Just in Time) or good enough requirements and then get into the data quickly and mold our design based on both the requirements and data.

The idea is to get into the data quickly and validate the business rules and assumptions made by business users. Data-driven design is about engaging the data early. It is more than data profiling, as data-driven design inspects and adapts in context of the target design. As we model our design, we immediately begin loading data into it, often by day one or two of the sprint. That is the key.

Early in the sprint, data-driven design marries the perspective of the source data to the perspective of the business requirements to identify gaps, transformation needs, quality issues, and opportunities to expand our design. End users generally know about the day to day business but are not aware of the data.

The data-driven design concept can be used whether an organization is practicing waterfall or agile methodology. It obviously fits very nicely with the agile methodologies and Scrum principles such as inspect and adapt. We inspect the data and adapt the design accordingly. Using DDD we can test the coverage and fit of the target schema, from the analytical user perspective. By encouraging the design and testing of target schema using real data in quick, iterative cycles, the development team can ensure that target schema designed for implementation have been thoroughly reviewed, tested and approved by end-users before project build begins.

Case Study: While working with a mega-retailer, in one of the projects I was decomposing business questions. We were working with promotions and discounts subject area and we had two metrics: Promotion Sales Amount and Commercial Sales Amount. Any item that was sold as part of a promotion is counted towards Promotion Sales and any item that is sold as regular is counted towards Commercial Sales. Please note that Discount Amount and Promotion Sales Amount are two very different metrics. While decomposing, the business user described that each line item within a transaction (header) would have the discount amount evenly proportioned.

Data driven design graphicFor example – Let’s say there is a promotion where if you buy 3 bottles of wine then you get 2 bottles free. In this case, according to the business user, there would be discount amount evenly proportioned across the 5 line items - thus indicating that these 5 line items are on promotion and we can count the sales of these 5 line items toward Promotion Sales Amount.

This wasn’t the case when the team validated this scenario against the data. We discovered that the discount amount was only present for the “get” items and not for the “buy” items. Using our example, discount amount was provided for the 2 free bottles (get) and not for 3 bottles (buy). This makes it hard to calculate Promotion Sales Amount for the 3 “buy” items since it wasn’t known if the customer just bought 3 items or 5 items unless we looked at all the records, which was in millions every day.

What if the customer bought 6 bottles of wine so ideally 5 lines are on promotion and the 6th line (diagram above) is commercial sales or regular sales? Looking at the source data there was no way of knowing which transaction lines are part of promotion and which aren’t.

After this discovery, we had to let the business users know about the inaccuracy for calculating Promotion Sales Amount. Proactively, we designed a new fact to accommodate for the reality of data. There were more complicated scenarios that the team discovered that the business user hadn’t thought of.

In the example above, we had the same item for “buy” and “get” which was wine. We found a scenario, where a customer bought a 6 pack of beer then got a glass free. This further adds to the complexity. After validating the business rules against source data, we had to request additional data for “buy” and “get” list to properly calculate Promotion Sales Amount.

Imagine finding out that you need additional source data to satisfy business requirements nine months into the project. Think about change request for data model, development, testing etc. With DDD, we found this out within days and adapted to the “data realities” within the same week. The team also discovered that the person at the POS system could either pick up a wine bottle and times it by 7 or he could “beep” each bottle one by one. This inconsistency makes a lot of difference such as one record versus 7 records in the source feed.

There were other discoveries we made along the way as we got into the data and designed the target schema while keeping the reality of the data in mind. We were also able to ensure that the source system has the right available grain that the business users required.

Grover Sachin bio pic blog small

Sachin Grover leads the Teradata Agile group within Teradata. He has been with Teradata for 5 years and has worked on development of Solution Modeling Building Blocks and helped define best practices for semantic data models on Teradata. He has over 10 years of experience in IT industry as a BI / DW architect, modeler, designer, analyst, developer and tester.

Selecting a Big Data Solution: 5 Questions to Ask

Posted on: February 18th, 2015 by Chris Twogood No Comments

 

Selecting a big data solution five questions to askFor years now certain enterprises such as big-box retailers, online pioneers and consumer credit innovators have been successfully leveraging big data – to the point where these early adopter organizations can outperform competitors 2-1. They gain insights across their world – from their view of customers, to customer interactions and their perspective of the category.

With such a disparity in performance between the big data literate and the big data phobic confirmed by the top consulting firms, how can there still be a lack of momentum in moving toward the big data light? Experts advise almost unanimously that big data must be the “next big move” among enterprises to stay competitive and have an edge in getting ahead.

The big data terrain is still foreign and intimidating. Assembled here are 5 things to consider as you approach implementing a big data solution. They have been tailored to give you an eye for identifying the most competitive costs, shortest time to value and biggest results. Familiarize yourself with these concepts. Make them your questions to ask providers.

1. How will this big data solution handle the rush of data today and tomorrow?

Big data will race toward you with a staggering velocity, in great variety and with extreme volume. With regards to high velocity, ensure your ability to implement real-time processing or ad hoc queries. Handling high volume is a matter of the right hardware and infrastructure. Accommodating variety is more complicated and requires subject matter expertise. Consider both acquisition of big data and big data processes for getting the data into usable shape. Experts can leverage variety into a big success, but it can also be an opportunity for big failure.

2. What is the total cost of the big data solution?

Total costs include the initial implementation charges for hardware and software, and the cost for maintenance and support for the second year. Add in necessary labor costs...for data scientists, IT resources and analysts. Consider the necessary manpower to achieve the desired ROI for year one and two.

3. Is the estimated time to value acceptable?

Extracting rapid value from big data is not easy today. Businesses are challenged to find, hire and retain big data analytic professionals who can handle the implementation and management.

Big Data solutions should be easy to implement and reduce time to value. The Teradata Aster Discovery Platform handlesmulti-structured data stores and offers 100+ pre-built analytics to quickly build big data apps. Included are visual functions for big data analytics & discovery.

4. What direct and indirect benefits should you expect from a big data solution?

Your organization should expect insights into increasing prospect conversions, reducing churn, upselling, improving customer experiences, marketing efficiency – all resulting in tangible benefits like increased revenue, efficiency or loyalty. Work with the big data solution provider to set realistic objectives like a lift in net profit margin for Year 1, Year 2, etc.

Enterprises should also discuss and expect increases in IT and end user productivity. Organizations have documented (with independent research firms) that as many as 20% of employees (IT and business) have a direct benefit of increased productivity from insight that can be quickly generated an implemented.

5. Are next generation short cuts or implementation aids available?

In your initial review of big data solutions and providers, compare offerings to determine if options like pre-built functions or applications or industry knowledgeable professional services are readily available and affordable. Search for means of significantly reducing the time to value, the ongoing labor costs and the magnitude of your return on investment.

Considering these factors will help ensure the fast and enduring success of your big data initiative so you can quickly take control of your organization’s competitiveness – in the era offering the biggest competitive growth opportunity in the last decade.

Get more insights into big data solutions.

What is Big Data?

Posted on: February 12th, 2015 by Chris Twogood No Comments

 

what is big dataWhat is Big Data? It’s not as simple as saying social media posts are big data or sensor data is big data. And it’s not sufficient to say big data is just a lot of data.

Beyond the idea of large volumes of data...or a greater scope of data...big data refers to data sets that exceed the boundaries and sizes of normal processing capabilities. They force “non-traditional” processing and require new forms of integration.

Without a new method of integration, more efficient processing and new analytics capabilities, it’s not possible to uncover the large hidden values from these large datasets that are diverse, complex, and of a massive scale.

So, understanding the answer to the question, "What is big data?" involves more than being able to identify a data type. Understanding the movement includes knowing the characteristics and origins of the data, its volume, its velocity and all the accommodations made to properly leverage it. Understanding the value of big data means being able to see how it can deliver an insightful, aggressive positioning for your organization.

As far as the characteristics of the data, consider three different formats:

1. Structured data (or traditional data) gets its name because it resides in a fixed field within a record or file. Structured data has the advantage of being easily entered, stored, queried and analyzed. Previously, because of costs and performance limitations, relational databases were the only way to effectively manage data. Anything that couldn't fit into a tightly organized structure couldn’t be used.

2. Unstructured data usually refers to information that is not stored in a relational database or is not organized in a pre-defined manner. Unstructured data files are typically text-heavy, but may contain data such as dates, numbers, and multimedia content. Examples include e-mail messages, call center transcripts, forums, blogs, videos and social network postings.

3. Semi-structured data is a cross between the two. With semi-structured data, tags or other types of markers are used to identify certain elements, but the data itself doesn’t have a rigid structure. For text documents, you can now include metadata with an author's name and creation data, but the bulk of the document is unstructured text. Emails have author, recipient, and time fields added to the unstructured content data. Semi-structured data is information that doesn't reside in a relational database but that does have some organizational properties that make it easier to analyze.

In real world or practical terms, review these examples where organizations have leveraged value from unstructured data and used it to give them an advantage over competitors:

You receive an e-mail containing an offer for a turnkey personal computer. You were exploring computers on that manufacturer’s web site just a few hours prior.

As you shop for homes on the web, you are served typical commute times to and from work for the homes you review. Drive times are determined by GPS signals from millions of drivers.

As far as understanding the question, What is big data... Today, companies derive value from diverse data sources using the latest in advanced analytic innovation. Big data analytics deduce previously inaccessible insights to inform decisions that can be more advantageous and tailored. These more enlightened actions may radically change how management views its business – and therefore can allow for new competitive strategies.

Get more Big Data Insights.

Lots of Big Data Talk, Little Big Data Action

Posted on: February 11th, 2015 by Manan Goel No Comments

 

 Apps Are One Solution To Big Data Complexity

Offering big data apps is a great way for the analytics industry to put its muscle where its mouth is. Organizations face great hurdles in trying to benefit from the opportunities of big data.  Extracting rapid value from big data remains challenging.

To ease companies into realizing bankable big data benefits, Teradata has developed a collection of big data apps – pre-built templates that act as time-saving short cuts to value. Limited skill sets and complexity make it challenging for analytic professionals to rapidly and consistently derive actionable insights that can be easily operationalized.  Teradata is taking the lead in offering advanced analytic apps powered by Teradata Aster AppCenter to give sophisticated results from big data analytics.

The big data apps from Teradata are industry tailored analytical templates that address business challenges specific to the individual category. Purpose-built apps for retail address path to purchase and shopping cart abandonment.  Apps for healthcare map the paths to surgery and drug prescription affinity. Financial apps tackle omni-channel customer experiences and fraud.  The industries covered include consumer financial, entertainment and gaming, healthcare, manufacturing, retail, communications, travel and hospitality.

Big data apps are pre-built templates that can be further configured with help from Teradata professional services to address specific customer needs or goals.  Organizations have found that specialized big data analytic skills like Python, R, Java and MapReduce take time and require highly specialized manpower. Conversely, apps deliver fast time to value with self-service analytics. The purpose-built apps can be quickly deployed and configured/customized with minimal effort to deliver swift analytic value.

For app distribution, consumption and custom app development, the AppCenter makes big data analytics secure, scalable and repeatable by providing common services to build, deploy and consume apps.

With the apps and related solutions like AppCenter from Teradata, analytic professionals spend less time preparing data and more time doing discovery and iteration to find new insights and value.

Get more big data insights now!

 

 

Teradata Aster AppCenter: Reduce the Chasm of Data Science

Posted on: February 11th, 2015 by John Thuma No Comments

 

Data scientists are doing amazing thing with data and analytics.  The data surface area is exploding with new data sources being invented and exploited almost daily.  The Internet of Things is being realized and is not just theory, it is in practice.   Tools and technology are making it easier for Data Scientists to develop solutions that impact organizations.  Rapid fire methods for predicting churn, providing a personalized next best offer or predicting part failures are just some of the new insights being developed across a variety of industries.

But challenges remain.  Data Science has a language and technique all of its own.  Strange words like: Machine Learning, Naïve Bayes, and Support Vector Machines are creeping into our organizations.   These topics can be very difficult to understand if you are not trained or have not spent time learning to perfect them.

There is a chasm between business and data science.  Reducing this gap and operationalizing big data analytics is paramount to the success of all Data Science efforts.  We must democratize and enable anyone to participate in big data discovery.  The Teradata Aster AppCenter is a big step forward in bridging the gap between data science and the rest of us.  The Teradata Aster AppCenter  makes big data analytics consumable by the masses.

Over the past two years I have personally worked on projects with organizations spanning various vertical industries.  I have engaged with hundreds of people across retail, insurance, government, pharmaceuticals, manufacturing, and others.  The one question that they all ask is: “John, I have people that can develop solutions with Aster; how do I integrate these solutions into my organization?  How can other people use these insights?”  Great questions!

I didn’t have an easy answer, but now I do. The Teradata Aster AppCenter provides a simple to use point and click web interface for consuming big data insights.  It wraps all the complexity and great work that Data Scientists do and gives it a simple interface that anyone can use.  It allows business people to have a conversation with their data like never before.  Data Scientists love it because it gives them a tool to showcase their solutions and their hard work.

Just the other day I deployed my first application in The Teradata Aster AppCenter.  I had never built one before, nor did I have any training or a phone a friend option.  I also didn’t want to have training because I am a technology skeptic.  Technology has to be easy to use.  So I put it to the test and here is what I found.

The interface is intuitive and I had a simple application deployed in 20 minutes.  Another 20 minutes went by and I had three visualization options embedded in my App.   I then constructed a custom user interface that provides drop down menus as options to make the application more flexible and interactive.  In that hour I built an application that anyone can use and they don’t have to know how to write a single line of code or be a technical unicorn.  I was blown away by the simplicity and power.   I am now able to deploy Teradata Aster solutions in minutes and publish them out to the masses.  The Teradata Aster AppCenter reduces the chasm between Data Science and the rest of us.

In conclusion, The Teradata Aster AppCenter passed my tests.  Please, don’t take my word for it, try it out.  Also, we have an abundance of videos, training materials, and templates on the way to guide your experience.  I am really looking forward to seeing new solutions developed and watching the evolution of platform.   The Teradata Aster AppCenter gives Data Science a voice and a platform for Next Generation Analytic consumption.