teradata

Real-Time SAP® Analytics: a look back & ahead

Posted on: August 18th, 2014 by Patrick Teunissen 1 Comment

 

On April 8, I hosted a webinar and my guest was Neil Raden, an independent data warehouse analyst. The topic of the webinar was: “Accessing of SAP ERP data for business analytics purposes” – which was built upon Neil’s findings in his recent white paper about the complexities of the integration of SAP data into the enterprise data warehouse. The attendance and participation in the webinar clearly showed that there is a lot of interest and expertise in this space. As I think back about the questions we received, both Neil and I were surprised by the number of questions that were related to “real-time analytics on SAP.”

Something has drastically changed in the SAP community!

Note: The topic of real time analytics is not new! I won’t forget Neil’s reaction when the questions came up. It was like he was in a time warp back to the early 2000’s when he first wrote about that topic. Interestingly, Neil’s work is still very relevant today.

This made me wonder why this is so prominent in the SAP space now? What has changed in the SAP community? What has changed in the needs of the business?

My hypothesis is that when Neil originally wrote his paper (in 2003) R/3 was SAP (or SAP was R/3 whatever order you prefer) and integration with other applications or databases was not something that SAP had on the radar yet. This began to change when SAP BW became more popular and gained even more traction with the release of SAP’s suite of tools and modules (CRM, SRM, BPC, MDM, etc.) -- although these solutions still clearly had the true SAP ‘Made in Germany’ DNA. Then came SAP’s planning tool APO, Netweaver XI (later PI) and, the 2007 acquisition of Business Objects (including BODS) which all accelerated SAP’s application integration techniques.

With Netweaver XI/PI and Business Objects Data Services, it became possible to integrate SAP R/3 in real time, making use of advanced messaging techniques like Idoc’s, RFC’s, and BAPI’s. These techniques all work very well for transaction system integration (EAI); however, these techniques do not have what it takes to provide real-time data feeds to the integrated data warehouse. At best a hybrid approach is possible. Back in 2000 my team worked on such a hybrid project at Hunter Douglas (Luxaflex). They combined classical ABAP-driven batch loads for managerial reports with real time capabilities (BAPI calls) for their more operational reporting needs. That was state-of-art in those days!

Finally, in 2010 SAP acquired Sybase and added a best of breed Data Replication software tool to the portfolio. With this integration technique, changed data is captured directly from the database taking the loads off of the R/3 application servers. This offers huge advantages, so it makes sense that this is now the recommended technique for loading data into the SAP HANA appliance.

“What has changed is that SAP has put the need for real-time data integration with R/3 on the (road) map!”

The main feature of our upcoming release of Teradata Analytics for SAP Solutions version 2.2 is a new data replication technique. Almost designed to prove my case, 10 years ago I was in the middle of working on a project for a large multinational company. One of my lead engineers, Arno Luijten, came to me with a proposal to try out a data replication tool to address the latencies introduced by the extraction of large volumes of changed data from SAP. We didn’t get very far at the time, because the technology and the business expectations were not ready for it. Fast forward to 2014 and we’re re-engaged with this same customer …. Luckily this time the business needs and the technology capabilities are ready to deliver!

In the coming months my team and I would like to take you on our SAP analytics journey.

In my next blogs we will dive into the definition (and relativity) of real-time analytics and discuss the technical complexities of dealing with SAP including the pool and cluster tables. So, I hope I got you hooked for the rest of the series!

Garbage In-Memory, Expensive Garbage

Posted on: July 7th, 2014 by Patrick Teunissen 2 Comments

 

A first anniversary is always special and in May I marked my first with Teradata. In my previous lives I celebrated almost ten years with Shell and seventeen years creating my own businesses focused on data warehousing and business intelligence solutions for SAP. With my last business “NewFrontiers” I leveraged all twenty seven years of ERP experiences to develop a shrink wrapped solution to enable SAP analytics. 

Through my first anniversary with Teradata, all this time, the logical design of SAP has been the same. To be clear, when I say SAP, I mean R/3 or ‘R/2 with a mouse’ if you’re old enough to remember. Today R/3 is also known as the SAP Business suite, ERP or whatever. Anyway, when I talk about SAP I mean the application that made the company rightfully world famous and that is used for transaction processing by almost all large multinational businesses.

My core responsibility at Teradata is the engineering of the analytical solution for SAP. My first order of business was focusing my team on delivering an end-to-end business analytic product suite to analyze ERP data that is optimized for Teradata. Since completing our first release, my attention turned to adding new features to help companies take their SAP analytics to the next level. To this end, my team is just putting the finishing touches on a near real-time capability based on data replication technology. This will definitely be the topic of upcoming blogs.

Over the past year, the integration and optimization process has greatly expanded my understanding of the differentiated Teradata capabilities. The one capability that draws in the attention of types like me ‘SAP guys and girls’ is Teradata Intelligent Memory. In-memory computing has become a popular topic in the SAP community and the computer’s main memory is an important part of Teradata’s Intelligent Memory. However Intelligent Memory is more than “In-Memory” -- because with Intelligent Memory, the database addresses the fact that not all memory is created equal and delivers a solution that uses the “right memory for the right purpose”. In this solution, the most frequently used data – the hottest -- is stored In-Memory; the warm data is processed from a solid state drive (SSD), and colder, less frequently accessed data from a hard disc drive (HDD). This solution allows your business to make decisions on all of your SAP and non-SAP data while coupling in-memory performance with spinning disc economics.

This concept of using the “right memory for the right purpose” is very compelling for our Teradata Analytics for SAP solutions. Often when I explain what Teradata Analytics for SAP Solutions does, I draw a line between DATA and CONTEXT. Computers need DATA like cars need fuel and the CONTEXT is where you drive the car. Most people do not go the same place every time but they do go to some places more frequently than others (e.g. work, freeways, coffee shops) and under more time pressure (e.g. traffic).

In this analogy, most organizations almost always start building an “SAP data warehouse” by loading all DATA kept in the production database of the ERP system. We call that process the initial load. In the Teradata world we often have to do this multiple times because when building an integrated data warehouse it usually involves sourcing from multiple SAP ERPs. Typically, these ERPs vary in age, history, version, governance, MDM, etc. Archival is a non-trivial process in the SAP world and the majority of the SAP systems I have seen are carrying many years of old data . Loading all this SAP data In-Memory is an expensive and reckless thing to do.

Teradata Intelligent Memory provides CONTEXT by storing the hot SAP data In-Memory, guaranteeing lightning fast response times. It then automatically moves the less frequently accessed data to lower cost and performance discs across the SSD and HDD media spectrum. The resulting combination of Teradata Analytics for SAP coupled with Teradata’s Intelligent Memory delivers in-memory performance with very high memory hit rates at a fraction of the cost of ‘In-Memory’ solutions. And in this business, costs are a huge priority.

The title of this Blog is a variation on the good old “Garbage In Garbage Out / GIGO” phrase; In-Memory is a great feature, but not all data needs to go there! Make use of it in an intelligent way and don’t use it as a garbage dump because for that it is too expensive.

Patrick Teunissen is the Engineering Director at Teradata responsible for the Research & Development of the Teradata Analytics for SAP® Solutions at Teradata Labs in the Netherlands. He is the founder of NewFrontiers which was acquired by Teradata in May 2013.

Endnotes:
1 Needless to say I am referring to SAP’s HANA database developments.

2 Data that is older than 2 years can be classified as old. Transactions, like sales and costs are often compared with the a budget/plan and the previous year. Sometimes with the year before that but hardly ever with data older than that.

MongoDB and Teradata QueryGrid – Even Better Together

Posted on: June 19th, 2014 by Dan Graham 3 Comments

 

It wasn’t so long ago that NoSQL products were considered competitors with relational databases (RDBMS). Well, for some workloads they still are. But Teradata is an analytic RDBMS which is quite different and complementary to MongoDB. Hence, we are teaming up for the benefit of mutual customers.

The collaboration of MongoDB with Teradata represents a virtuous cycle, a symbiotic exchange of value. This virtuous cycle starts when data is exported from MongoDB to Teradata’s Data Warehouse where it is analyzed and enriched, then sent back to MongoDB to be exploited further. Let me give an example.

An eCommerce retailer builds a website to sell clothing, toys, etc. They use MongoDB because of the flexibility to manage constantly changing web pages, product offers, and marketing campaigns. This front office application exports JSON data to the back-office data warehouse throughout the business day. Automated processes analyze the data and enrich it, calculating next best offers, buyer propensities, consumer profitability scores, inventory depletions, dynamic discounts, and fraud detection. Managers and data scientists also sift through sales results looking for trends and opportunities using dashboards, predictive analytics, visualization, and OLAP. Throughout the day, the data warehouse sends analysis results back to MongoDB where they are used to enhance the visitor experience and improve sales. Then we do it again. It’s a cycle with positive benefits for the front and back office.

Teradata Data Warehouses have been used in this scenario many times with telecommunications, banks, retailers, and other companies. But several things are different working with MongoDB in this scenario. First, MongoDB uses JSON data. This is crucial to frequently changing data formats where new fields are added on a daily basis. Historically, RDBMS’s did not support semi-structured JSON data. Furthermore, the process of changing a database schema to support frequently changing JSON formats took weeks to get through governance committees.

Nowadays, the Teradata Data Warehouse ingests native JSON and accesses it through simple SQL commands. Furthermore, once a field in a table is defined as JSON, the frequently changing JSON structures flow right into the data warehouse without spending weeks in governance committees. Cool! This is a necessary big step forward for the data warehouse. Teradata Data Warehouses can ingest and analyze JSON data easily using any BI tool or ETL tool our customers prefer.

Another difference is that MongoDB is a scale-out system, growing to tens or hundreds of server nodes in a cluster. Hmmm. Teradata systems are also scale-out systems. So how would you exchange data between Teradata Data Warehouse server nodes and MongoDB server nodes? The simple answer is to export JSON to flat files and import them to the other system. Mutual customers are already doing this. Can we do better than import/export? Can we add an interactive dynamic data exchange? Yes, and this is the near term goal of our partnership --connecting Teradata QueryGrid to MongoDB clusters.

Teradata QueryGrid and Mongo DB

Teradata QueryGrid is a capability in the data warehouse that allows a business user to issue requests via popular business intelligence tools such as SAS®, Tableau®, or MicroStrategy®. The user issues a query which runs inside the Teradata Data Warehouse. This query reaches across the network to the MongoDB cluster. JSON data is brought back, joined to relational tables, sorted, summarized, analyzed, and displayed to the business user. All of this is done exceptionally fast and completely invisible to the business user. It’s easy! We like easy.

QueryGrid can also be bi-directional, putting the results of an analysis back into the MongoDB server nodes. The two companies are working on hooking up Teradata QueryGrid right now and we expect to have the solution early in 2015.

The business benefit of connecting Teradata QueryGrid to MongoDB is that data can be exchanged in near real time. That is, a business user can run a query that exchanges data with MongoDB in seconds (or a few minutes if the data volume is huge). This means new promotions and pricing can be deployed from the data warehouse to MongoDB with a few mouse clicks. It means Marketing people can analyze consumer behavior on the retail website throughout the day, making adjustments to increase sales minutes later. And of course, applications with mobile phones, sensors, banking, telecommunications, healthcare and others will get value from this partnership too.

So why does the leading NoSQL vendor partner with the best in class analytic RDBMS? Because they are highly complementary solutions that together provide a virtuous cycle of value to each other. MongoDB and Teradata are already working together well in some sites. And soon we will do even better.

Come visit our Booth at MongoDB World and attend the session “The Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse” Riverside Suite, 3:10 p.m., June 24. You can read more about the partnership between Teradata and MongoDB in this news release issued earlier today. Also, check out the MongoDB blog.

PS: The MongoDB people have been outstanding to work with on all levels. Kudos to Edouard, Max, Sandeep, Rebecca, and others. Great people!

 

It happens every few years and it’s happening again. A new technology comes along and a significant segment of the IT and business community want to toss out everything we’ve learned over the past 60 years and start fresh. We “discover” that we’ve been wasting time applying unnecessary rigor and bureaucracy to our projects. No longer should we have to wait three to six months or longer to deliver technical solutions to the business. We can turn these things around in three to six days or even less.

In the mid 1990’s, I was part of a team that developed a “pilot” object-oriented, client-server (remember when these were the hot buzzwords?) application to replenish raw materials for a manufacturing function. We were upending the traditional mainframe world by delivering a solution quickly and iteratively with a small team. When the end users started using the application in real life, it was clear they were going to rely on it to do their jobs every day. Wait, was this a pilot or…? I would come into work in the morning, walk into a special room that housed the application and database servers, check the logs, note any errors, make whatever fixes needed to be made, re-run jobs, and so on.

It wasn’t long before this work began to interfere with my next project, and the end users became frustrated when I wasn’t available to fix problems quickly. It took us a while and several conversations with operations to determine that “production” didn’t just mean “the mainframe”. “Production” meant that people were relying on the solution on a regular basis to do their jobs. So we backtracked and started talking about what kind of availability guarantees we could make, how backup and recovery should work, how we could transition monitoring and maintenance to operations, and so on. In other words, we realized what we needed was a traditional IT project that just happened to leverage newer technologies.

This same scenario is happening today with Hadoop and related tools. When I visit client organizations, a frightening number will have at least one serious person saying something like, “I really don’t think ‘data warehousing’ makes sense any more. It takes too long. We should put all our data in Hadoop and let our end users access whatever they want.” It is indeed a great idea to establish an environment that enables exploration and quick-turnaround analysis against raw data and production data. But to position this approach as a core data and analytics strategy is nothing short of professional malpractice.

The problem is that people are confusing experimentation with IT projects. There is a place for both, and there always has been. Experimentation (or discovery, research, ad-hoc analysis, or whatever term you wish to use) should have lightweight processes and data management practices – it requires prioritization of analysis activity, security and privacy policies and implementation, some understanding of available data, and so on, but it should not be overburdened with the typical rigor required of projects that are building solutions destined for production. Once a prototype is ready to be used on a regular basis for important business functions, that solution should be built through a rigorous IT project leveraging an appropriate – dare I say it – solution development life cycle (SDLC), along with a comprehensive enterprise architecture plan including, yes, a data warehouse that provides integrated, shared, and trusted production data.

An experimental prototype should never be “promoted” to a production environment. That’s what a project is for. Experimentation can be accomplished with Hadoop, relational technology, Microsoft Office, and many other technologies. These same technologies can also be used for production solutions. So, it’s not that “things are done differently and more quickly in Hadoop”. Instead, it’s more appropriate to say that experimentation is different than an IT project, regardless of technology.

Yes, we should do everything we can to reduce unnecessary paperwork and to speed up delivery using proper objective setting, scoping, and agile development techniques. But that is different than abandoning rigor altogether. In fact, using newer technologies in IT projects requires more attention to detail, not less, because we have to take the maturity of the technology into consideration. Can it meet the service level needs of a particular solution? This needs to be asked and examined formally within the project.

Attempting to build production solutions using ad-hoc, experimental data preparation and analysis techniques is like building a modern skyscraper with a grass hut mentality. It just doesn’t make any sense.

Guest Blogger Kevin Lewis is responsible for Teradata’s Strategy and Governance practice. Prior to joining Teradata in 2007, he was responsible for initiating and leading enterprise data management at Publix Super Markets. Since joining Teradata, he has advised dozens of clients in all major industries. 

How $illy is Cost per Terabyte?

Posted on: May 16th, 2014 by Dan Graham No Comments

 

Without question, the best price per terabyte anywhere in the technology industry is the home PC. You can get a Dell® PC at about $400 and it comes with a terabyte disk drive. WOW! I found one PC for $319 per TB! Teradata, Oracle, IBM, and all the other vendors are headed for the scrap heap of history with those kinds of prices. I’m sending out my resume in the morning. . . How silly is that? Yet when comparing massively parallel database computers – the culmination of 50 years of data processing innovation-- many organizations overemphasize $/TB and disregard total value. They hammer the vendors to lower the price, lower the price, until – you guessed it – the vendors hit the right price by also lowering the value. This reached a crescendo over the last few years following the worldwide recession. Saving money became much more important than producing business value. I get it – a corporation runs on cost containment and revenue generation. As it turns out, a data warehouse is a vital tool enabling both business objectives – especially in hard economic times.

I understand why CFOs and procurement people obsess on dollars per terabyte. They can’t understand all the technical geek-speak but they do know that hollering about cost per terabyte makes vendors and CIOs scramble. OK, that seems worthwhile but there is a flaw in this thinking when $/TB is the first and foremost buying criteria.

By analogy, would you buy a car based on price alone? No. Even if you are strapped for money, you search for features and value in the collection of cars that are affordable. Price is one decision point, not THE decision maker. I always buy a little beyond my means to get the highest quality. Purchase price is a point in time angst but I have to live with that car for years. It’s never failed me and I am always satisfied years later.

$/TB as Proxy for All the Value
System price is crucial at the beginning of a purchasing process to select candidates, and again at the end when real money is being exchanged. In between, there is often an assumption that candidate systems can all do the same job. Well, no two vendor systems are identical, especially massively parallel data warehouses. Indeed, they vary dramatically. But let’s assume for a moment that two vendor products are 80% equivalent in the workloads they can do and the labor it takes to manage them.

What is always lost in these comparisons is the actual performance of the queries as measured at the business user’s desk. Massively parallel databases are highly differentiated. Some are quite slow when compared to others. Some are lightning fast on table scans then choke when complex joins are needed. Some can only handle a dozen users at a time. Many flounder running mixed workloads. Some are good enough at simple queries on simple database designs, but collapse when complex queries are required. If you are just starting out, simple queries may be OK. But to become an analytic competitor, really complex queries are inevitably de rigueur. Plus, any successful analytic database project will see major expansions of user demands and query complexity over the first 3-5 years, then incremental after that. Or is it the other way around --top quality analytic databases encourage users to ask more complex questions? Hmmm.

Performance Performance Performance
The primary purpose of databases has always been performance, performance, performance. Number two is high availability since performance is uninteresting when the system is offline. Over-emphasizing cost per terabyte drives out the value of performance. But if the buyer wants vendors to optimize for cost per terabyte, query performance and software features will be reduced to meet that goal.

This means having employees do extra work since the system is no longer doing it. This means user productivity and accuracy is reduced as dozens of data warehouse users take extra minutes to do what could have been done in seconds. It means not rerunning an analysis four times to improve accuracy because it takes too long. It means users interact less with the data and get fewer brilliant insights because it takes too long to try out ideas. And it means not getting that rush report to the executives by 11AM when they ask for it at 10:40. All of this angst is hard to measure but the business user surely feels it.

The better metric has always been price/performance. Let me suggest an even more rounded (wink) view of buying criteria and priority:

 

 

 

 

 

---No, today is not the day to delve deeply into the percentages on this chart. But suffice it to say they are derived from analyst house research and other sources I’ve witnessed over the years. And yes they vary a few percentage points for every organization. Instead of price, TCO is dramatically more important to the CIO and CFO “who has to live with this car for years.” Performance is vital to the business user – cut this back and you might ask “why pretend to have an analytic database since users will avoid running queries?” Features and functions are something the programmers and DBAs love and should not be overlooked.

Teradata – the Low Price Leader?
Changes in supplier costs and price pressures from the recent recession are producing bargains for data warehouse buyers. Take a look at Teradata list prices from 2Q2014.

 

 

Each Teradata platform described above  includes Teradata quality hardware, the Teradata Database, utilities, and storage using uncompressed data. These are list prices so let the negotiations begin! With $3.8K per terabyte, anyone can afford Teradata quality now.

Obviously you noticed the $34K/terabyte systems. Need I say that these are the most robust, highest performing systems in the data warehouse market? Both Gartner’s Magic Quadrant and Forrester’s Data Warehouse Wave assessments rate Teradata the top data warehouse vendor as of 1Q14. These systems support large user populations, millions of concurrent queries per day, integrated data, sub-second response time on many queries, row level security, and dozens of applications per system. The Active Enterprise Data Warehouse is the top of the line with solid state disks, the fastest configuration, capacity on demand, and many other upscale capabilities. The Integrated Big Data Platform is plenty fast but not in the same class as the Active Enterprise Data Warehouse. There are a dozen great use cases for this cost conscious machine but 500 users with enormously complex queries won’t work on smaller configurations. But it quickly pays for itself.

Chant: Dollars per Terabyte, Dollars per Terabyte ...
The primary value proposition on the lips of the NoSQL and Hadoop vendors is always “cost per terabyte.” This is common with new products in new markets – we’ve heard it before from multiple of startup MPP vendors. It’s impossible to charge top dollar for release 1.0 or 2.0 since they are still fairly incomplete. So when you have little in the way of differentiated value, dollars per terabyte is the chant. But is five year old open source software really equivalent to 30 years of R&D investment in relational database performance? Not.

I looked at InformationWeek’s article on “10 Hadoop Hardware Leaders” (4/24/2014) which includes the Dell R720XD servers as a leader in Hadoop hardware. Pricing out an R720XD on the Dell website, I found a server with 128GB of memory and twelve 1.2TB disks comes in at $15,276. That’s $1060 per terabyte. Cool. However, Hadoop needs two replicas of all data to provide basic high availability. That means you need to buy three nodes. This makes the cost per terabyte $3182. Then you add some free software and lots of do-it-yourself labor. Seems to me that puts it in the same price band as the Integrated Big Data Platform. But the software on that machine is the same Teradata Database running on the Active Enterprise Data Warehouse. Sounds like a bargain to me!

Conclusion
Over reliance on $/TB does bad things to your business user’s productivity. Startups always make this a gut wrenching issue for customers to solve but as their products mature, that noise fades into the background. I recommend a well-rounded assessment of any vendor product that serves many business users and needs.

Ok, so now, I’m hooking up 50 terabytes of storage to my whiz bang 3.6Ghz Intel® home office Dell PC. I’m anxious to know how long it will take to scan and sort 20 terabytes. I’ll let you know tomorrow, or the next day, or whenever it finishes.

Dan Graham is responsible for strategy, go-to-market success, and competitive differentiation for the Active Data Warehouse platform and Extreme Performance Appliances at Teradata.

 

Every self-respecting data management professional knows that “business alignment” is critical to the success of a data and analytics program. But what does business alignment really mean? How do you know if your program is aligned to the business?

Before describing what business alignment is, let me first list what it is not:
• Interviewing end users to understand their needs for data and analytics
• Recruiting a highly placed and influential executive sponsor
• Documenting a high return on investment
• Gaining agreement on the data strategy from multiple business areas
• Establishing a business-led data governance program
• Establishing a process to prioritize data requests and issues

It’s not that the items on this list are bad ideas. It’s just that they are missing a key ingredient that, in my experience with dozens of clients, makes all the difference. None of these items are even the best first step in developing a data strategy.

So what’s wrong with the list? Let me illustrate with an example. I was working with a team developing a data strategy for a large manufacturing company. We were interviewing a couple of high level managers in marketing, and it went something like this:

Me: What are some of the major business initiatives that you’re expected to deliver this year and next year? Do you have some thoughts on the data and analytics that will be needed within those initiatives?

Marketing manager: Sure, well, we have this targeted marketing initiative that we think will be a big winner. When a customer contacts us for warranty information, we think we can cross-sell products from another business unit… here’s a spreadsheet… we’ve calculated that this will bring back $14 million in additional revenue every year. We’re so excited that you’re doing the data warehouse initiative… We’ve been proposing this marketing idea for the last four years and haven’t been able to get it approved, and now we can finally get it done!

Me: I didn’t ask what you think the business initiatives should be; I asked you what they already are! (Ok, I really didn’t say it that way, but I wanted to.)

Why couldn’t they get the project approved? Who knows? Maybe the ROI was questionable. Maybe the idea wasn’t consistent with the company strategy and image. All that matters is that it was not approved, and hence makes for a lousy value proposition for a data and analytics program.

There is nothing wrong with proposing exciting, new “art of the possible” ways that data can bring value to the business. But an interesting proposal and an approved initiative are not the same thing. The difference is crucial, and data management leaders who don’t understand this difference are unlikely to be seen as trusted strategic advisors within their companies.

So what does it mean to be business aligned? Business alignment means being able to clearly state how deployment of data, analytics, and data management capabilities will directly support planned and approved (meaning funded) business initiatives.

So, the first step toward developing a successful data strategy is not to ask the end users what data they want. Instead, the first step is to simply find the top business initiatives. They are usually not hard to find. Very often, there are posters all over the place about these initiatives. There are a number of people in the organization you can check with to find top initiatives - the CIO, PMO leads, IT business liaisons, and contacts in the strategic planning department are examples of good places to start.

Then, you should examine the initiatives and determine the data and analytics that will be needed to make each initiative successful, especially looking for the same data needed by multiple projects across multiple initiatives. Core, enterprise data is usually needed by a diverse set of initiatives in slightly different form. For example, let’s say you work for a retailer and you identify approved projects for pricing optimization, labor planning, and marketing attribution. You can make a case that you will deploy the sales and product data these applications need, in the condition needed, in the time frame needed.

Proceeding further, you can propose and champion a series of projects that deliver the data needed by various initiatives. By doing this, along with establishing architecture and design principles of scalability and extensibility, you harness the energy of high-priority projects (instead of running away from it) to make your business case, add value by supporting the value of pre-vetted initiatives, and also build a foundation of integrated and trusted data step by step, project by project. Once this plan is established and in motion, you can accurately state that your program is absolutely needed by the business and you are also deploying data the right way – and you can also say that your program is officially business aligned.

Guest Blogger Kevin Lewis is responsible for Teradata’s Strategy and Governance practice. Prior to joining Teradata in 2007, he was responsible for initiating and leading enterprise data management at Publix Super Markets. Since joining Teradata, he has advised dozens of clients in all major industries.

 

In the Star Trek movies, “the Borg” refers to an alien race that conquers all planets, absorbing the people, technology, and resources into the Borg collective. Even Captain Picard becomes a Borg and chants “We are the Borg. You will be assimilated. Resistance is futile.”

It strikes me that the relational database has behaved similarly since its birth. Over the last thirty years, Teradata and other RDBMS vendors have innovated and modernized, constantly revitalizing what it means to be an RDBMS. But some innovations come from start-up companies that are later assimilated into the RDBMS. And some innovations are reactions to competition. Regardless, many innovations eventually end up in the code base of multiple RDBMS vendor products --with proper respect to patents of course. Here are some examples of cool technologies assimilated into Teradata Database:

• MOLAP cubes storm the market in the late 1990s with Essbase setting the pace and Cognos inventing desktop cubes. MicroStrategy and Teradata team up to build push-down ROLAP SQL into the database for parallel speed. Hyperion Essbase and Teradata also did Hybrid OLAP integration together. Essbase gets acquired, MOLAP cubes fall out of fashion, and in-database ROLAP goes on to provide the best of both worlds as CPUs get faster.

• Early in the 2000s, a startup called Sunopsis shows a distinct advantage of running ELT transformations in-database to get parallel performance with Teradata. ELT takes off in the industry like a rocket. Teradata Labs also collaborates with Informatica to push-down PowerCenter transformation logic into SQL for amazing extract, load, and transform speed. Sunopsis gets acquired. More ETL vendors adopt ELT techniques. Happy DBAs and operations managers meet their nightly batch performance goals. More startups disappear.

• XML and XQuery becomes the rage in the press -- until most every RDBMS adds a data type for XML --plus shred and unshred operators. XML-only database startups are marginalized.

• The uptick of predictive analytics in the market drives collaboration between Teradata and SAS back in 2007. SAS Procs are pushed-down into the database to run massively parallel, opening up tremendous performance benefits for SAS users. This leads to many RDBMS vendors who copy this technique; SAS is in the limelight, and eventually even Hadoop programmers want to run SAS in parallel. Later we see “R,” Fuzzy Logix, and others run in-database too. Sounds like the proverbial win-win to me.

• In-memory technology from QlikView and TIBCO SpotFire excites the market with order-of magnitude performance gains. Several RDBMS vendors then adopt in-memory concepts. But in-memory has limitations on memory size and cost vis-à-vis terabytes of data. Consequently, Teradata introduces Teradata Intelligent Memory that caches hot data automatically in-memory while managing many terabytes of hot and cold data on disk. Two to three percent of the hottest data is managed by data temperature (aka - popular with users), delivering superfast response time. Cool! Or is it hot?

• After reading the Google research paper on MapReduce, a startup called “AsterData” invents SQL-MapReduce (SQL-MR) to add flexible processing to a flexible database engine. This cool innovation causes Teradata to acquire AsterData. Within a year, Aster strikes a nerve across the industry – MapReduce is in-database! This month, Aster earns numerous #1 scores in Ovum’s “Decision Matrix: Selecting an Analytic Database 2013-14” Jan 2014. The race is on for MapReduce in-database!

• The NoSQL community grabs headlines with their unique designs and reliance on JSON data and key-value pairs. MongoDB is hot, using JSON data while CouchBase and Cassandra leverage key-value stores. Teradata promptly decides to add JSON data (unstructured data) to the database and goes the extra mile to put JSONPath syntax into SQL. Teradata also adds the name-value-pair SQL operator (NVP) to extract JSON or key-value store data from weblogs. Schema-on-read technology gets assimilated into the Teradata Database. Java programmers are pleased. Customers make plans. More wins.

--------------------------------------------------------------------------------------------------------

“One trend to watch going forward, in addition to the rise of multi-model NoSQL databases, is the integration of NoSQL concepts into relational databases. One of the methods used in the past by relational database vendors to restrict the adoption of new databases to handle new data formats has been to embrace those formats within the relational database. Two prime examples would be support for XML and object-oriented programming.”
- Matt Aslett, The 451 Group, Next-Generation Operational Databases 2012-2016, Sep 17, 2013

--------------------------------------------------------------------------------------------------------

I’ve had conversations with other industry analysts and they’ve confirmed Matt’s opinion: RDBMS vendors will respond to market trends, innovations, and competitive threats by integrating those technologies into their offering. Unlike the Borg, a lot of these assimilations by RDBMS are friendly collaborations (MicroStrategy, Informatica, SAS, Fuzzy Logix, Revolution R, etc.). Others are just the recognition of new data types that need to be in the database (JSON, XML, BLOBs, geospatial, etc.).

Why is it good to have all these innovations inside the major RDBMS’s? Everyone is having fun right now with their science projects because hype is very high for this startup or that startup or this shiny new thing. But when it comes time to deploy production analytic applications to hundreds or thousands of users, all the “ities” become critical all of a sudden – “ities” that the new kids don’t have and the RDBMS does. “ities” like reliability, recoverability, security, and availability. Companies like Google can bury shiny new 1.oh-my-god quality software in an army of brilliant computer scientists. But Main Street and Wall Street companies cannot.

More important, many people are doing new multi-structured data projects in isolation -- such as weblog analysis, sensor data, graph analysis, or social text analysis. Soon enough they discover the highest value comes from combining that data with all the rest of the data that the organization has collected on customers, inventories, campaigns, financials, etc. Great, I found a new segment of buyer preferences. What does that mean to campaigns, sales, and inventory? Integrating new big data into an RDBMS is a huge win going forward – much better than keeping the different data sets isolated in the basement.

Like this year’s new BMW or Lexus, RDBMS’s modernize, they define modern. But relational database systems don’t grow old, they don’t rust or wear out. RDBMS’s evolve to stay current and constantly introduce new technology.

We are the RDBMS! Technology will be assimilated. Resistance is futile.

Evaluating and Planning for the Real Costs of Big Data

Posted on: January 16th, 2014 by Dan Graham No Comments

 

In a blog I posted in early December, I talked about the total cost of big data. That post, and today’s follow-up post, stem from a webinar that I moderated between Richard Winter, President of Wintercorp, specializing in massive databases, and Bob Page, VP of Products at Hortonworks. During the webinar we discussed how to successfully calibrate and calculate the total cost of data and walked through important lessons related to the costs around running workloads on various platforms including Hadoop. If you haven’t listened to the webinar yet, I recommend you do so.

From the discussion we had during that session and from resulting conversations I have had since, I wanted to address some of the key takeaways we discussed about how to be successful when tackling such a large challenge within your organization. Here are a few key points to consider:

1. Start Small: As Bob Page said, “It’s very easy to dream big and go overboard with these projects, but the key to success is starting small.” Have your first project be a straightforward proof of concept. There are undoubtedly going to be challenges when you are starting your first big data project, but if you can start at a smaller level and build your knowledge and capabilities, your odds of success for the larger projects improve. Don’t make your first venture out of the gate an attempt at a gargatuan project or huge amount of data. When you have some positive results, you will also have the confidence and sanction to build bigger solutions.

2. Address the Entire Scope of Costs: Rather than making the mistake of focusing on upfront purchasing costs only, any total cost of data evaluation must incorporate all possible costs, reflecting an estimate of owning and using data over time for analytic purposes. The framework that Richard developed allows you to do exactly that. It is a framework for estimating the total cost of a big data initiative. During the webinar, Richard discussed the five components of system costs:

  • the hardware acquisition costs
  • the software acquisition costs
  • what you pay for support
  • what you pay for upgrades
  • and what you pay for environmental/infrastructure costs – power and cooling.

According to Richard, we need to estimate the CAPEX and OPEX over five years.  Based on his extensive experience, he also recommends a moderate annual growth assumption of 26 percent in the system capacity. In my experience, most data warehouses double in size every 3 years so Richard is being conservative. Thus the business goal coupled with the CAPEX and OPEX thresholds year by year helps keep the team focused.  For many technical people, the TCOD planning seems like a burden, but it’s actually a career saver. If you are able to control the scope at a relatively low level and can leverage a tool - such as Richard’s framework – you have a higher chance of being successful.

3. Comparison Shop: Executives want to know the total cost of carrying out a large project, whether it is on a data warehouse or Hadoop. Having the ability to compare overall costs between the two systems is important to the overall internal success of the project and to the success of future projects being evaluated as well. Before you can compare anything, it is important to identify a real workload that your business and the executive team can consider funding.  The real workload focuses the comparisons as opposed to generalizations and guesses.  At some point a big data platform selection will generate two analyses you need to work though: 1) what is this workload costing? and 2) which platform can technically accomplish the goals more easily?” Lastly, in a perfect world, the business users should also be able to showcase the business value of the workload.

4. Align Your Stakeholders: Many believe that 60 percent of the work in a project should be in the planning and 40 percent of the execution. In order to evaluate your big data project appropriately, you must incorporate as many variables as possible.  It’s the surprises and stakeholders who weren’t aligned that cause a lot of the big cost over runs. Knowing your assets and stakeholders is key to succeeding. Which is why we recommend using the TCOD framework to get stakeholders to weigh in and achieve alignment on the overall plan. Next, leverage the results as a project plan that you can use toward achieving ROI. By leveraging a framework such as the one that Richard discusses during the webinar, what becomes very clear is that having each assumption, each formula and each of the costs exposed within this framework (in Richard’s there are 60 different costs outlined!), you can identify much more easily where the costs differ and – more importantly – why. The TCOD framework can bring stakeholders into the decision-making process, forming a committed team instead of bystanders and skeptics .

5. Focus on Data Management: One of the things that both of our esteemed webinar guests pointed out is the importance of the number of people and applications accessing big data simultaneously. Data is typically the life-blood of the organization. This includes accessing live information about what is happening now, as well as accurate reporting at the end of the day, month, and quarter. There is a wide spectrum of use cases and each is being used across a wide variety of data types. If you haven’t actually built a 100-terabyte database or distributed file system before, be ready for some painful “character building” surprises. Be ready again at 500TB, at a petabyte, and 5 petabytes. Big data volumes are like the difference between a short weekend hike and making it past base camp on Mount Everest.  Your data management skills will be tested.

During the webinar, our experts all agreed: there is a peaceful coexistence that can happen between Hadoop and the data warehouse. They should be applied to the right workloads and share data as often as possible. When a workload is defined, it becomes clear that some data belongs in the data warehouse while other types of data may be more appropriate in Hadoop. Once you have put your data into its enterprise residence, each will feed their various applications.

In conclusion, being able to leverage a framework, such as the TCOD one that was discussed during the webinar, really lends itself to having a solid plan when approaching your big data challenges and to ultimately solving them.

Here are some additional resources for further information:

Total Cost of Data Webinar

Big Data—What Does It Really Cost? (white paper)

The Real Cost of Big Data (Spreadsheet)

TCOD presentation slides (PDF)

Big Apple Hosts the Final Big Analytics Roadshow of the Year

Posted on: November 26th, 2013 by Teradata Aster No Comments

 

Speaking of ending things on a high note, New York City on December 6th will play host to the final event in the Big Analytics 2013 Roadshow series. Big Analytics 2013 New York is taking place at the Sheraton New York Hotel and Towers in the heart of Midtown on bustling 7th Avenue.

As we reflect on the illustrious journey of the Big Analytics 2013 Roadshow, kicking off in San Francisco, this year the Roadshow traveled through major international destinations including Atlanta, Dallas, Beijing, Tokyo, London and finally culminating at the Big Apple – it truly capsulated the appetite today for collecting, processing, understanding and analyzing data.

Big Analytics Atlanta 2013 photo

Big Analytics Roadshow 2013 stops in Atlanta

Drawing business & technical audiences across the globe, the roadshow afforded the attendees an opportunity to learn more about the convergence of technologies and methods like data science, digital marketing, data warehousing, Hadoop, and discovery platforms. Going beyond the “big data” hype, the event offered learning opportunities on how technologies and ideas combine to drive real business innovation. Our unyielding focus on results from data is truly what made the events so successful.

Continuing on with the rich lineage of delivering quality Big Data information, the New York event promises to pack tremendous amount of Big Data learning & education. The keynotes for the event include such industry luminaries as Dan Vesset, Program VP of Business Analytics at IDC, Tasso Argyros, Senior VP of Big Data at Teradata & Peter Lee, Senior VP of Tibco Software.

Photo of the Teradata Aster team in Dallas

Teradata team at the Dallas Big Analytics Roadshow


The keynotes will be followed by three tracks around Big Data Architecture, Data Science & Discovery & Data Driven Marketing. Each of these tracks will feature industry luminaries like Richard Winter of WinterCorp, John O’Brien of Radiant Advisors & John Lovett of Web Analytics Demystified. They will be joined by vendor presentations from Shaun Connolly of Hortonworks, Todd Talkington of Tableau & Brian Dirking of Alteryx.

As with every Big Analytics event, it presents an exciting opportunity to hear first hand from leading organizations like Comcast, Gilt Groupe & Meredith Corporation on how they are using Big Data Analytics & Discovery to deliver tremendous business value.

In summary, the event promises to be nothing less than the Oscars of Big Data and will bring together the who’s who of the Big Data industry. So, mark your calendars, pack your bags and get ready to attend the biggest Big Data event of the year.

Teradata’s UDA is to Data as Prius is to Engines

Posted on: November 12th, 2013 by Teradata Aster No Comments

 

I’ve been working in the analytics and database market for 12 years. One of the most interesting pieces of that journey has been seeing how the market is ever-shifting. Both the technology and business trends during these short 12 years have massively changed not only the tech landscape today, but also the future of evolution of analytic technology. From a “buzz” perspective, I’ve seen “corporate initiatives” and “big ideas” come and go. Everything from “e-business intelligence,” which was a popular term when I first started working at Business Objects in 2001, to corporate performance management (CPM) and “the balanced scorecard.” From business process management (BPM) to “big data”, and now the architectures and tools that everyone is talking about.

The one golden thread that ties each of these terms, ideas and innovations together is that each is aiming to solve the questions related to what we are today calling “big data.” At the core of it all, we are searching for the right way to enable the explosion of data and analytics that today’s organizations are faced with, to simply be harnessed and understood. People call this the “logical data warehouse”, “big data architecture”, “next-generation data architecture”, “modern data architecture”, “unified data architecture”, or (I just saw last week) “unified data platform”.  What is all the fuss about, and what is really new?  My goal in this post and the next few will be to explain how the customers I work with are attacking the “big data” problem. We call it the Teradata Unified Data Architecture, but whatever you call it, the goals and concepts remain the same.

Mark Beyer from Gartner is credited with coining the term “logical data warehouse” and there is an interesting story and explanation. A nice summary of the term is,

The logical data warehouse is the next significant evolution of information integration because it includes ALL of its progenitors and demands that each piece of previously proven engineering in the architecture should be used in its best and most appropriate place.  …

And

… The logical data warehouse will finally provide the information services platform for the applications of the highly competitive companies and organizations in the early 21st Century.”

The idea of this next-generation architecture is simple: When organizations put ALL of their data to work, they can make smarter decisions.

It sounds easy, but as data volumes and data types explode, so does the need for more tools in your toolbox to help make sense of it all. Within your toolbox, data is NOT all nails and you definitely need to be armed with more than a hammer.

In my view, enterprise data architectures are evolving to let organizations capture more data. The data was previously untapped because the hardware costs required to store and process the enormous amount of data was simply too big. However, the declining costs of hardware (thanks to Moore’s law) have opened the door for more data (types, volumes, etc.) and processing technologies to be successful. But no singular technology can be engineered and optimized for every dimension of analytic processing including scale, performance or concurrent workloads.

Thus, organizations are creating best-of-breed architectures by taking advantage of new technologies and workload-specific platforms such as MapReduce, Hadoop, MPP data warehouses, discovery platforms and event processing, and putting them together into, a seamless, transparent and powerful analytic environment. This modern enterprise architecture enables users to get deep business insights and allows ALL data to be available to an organization, creating competitive advantage while lowering the total system cost.

But why not just throw all your data into files and put a search engine like Google on top? Why not just build a data warehouse and extend it with support for “unstructured” data? Because, in the world of big data, the one-size-sits-all approach simply doesn’t work.

Different technologies are more efficient at solving different analytical or processing problems. To steal an analogy from Dave Schrader—a colleague of mine—it’s not unlike a hybrid car. The Toyota Prius can average 47 mpg with hybrid (gas and electric) vs. 24 mpg with a “typical” gas-only car – almost double! But you do not pay twice as much for the car.

How’d they do it? Toyota engineered a system that uses gas when I need to accelerate fast (and also to recharge the battery at the same time), electric mostly when driving around town, and braking to recharge the battery.

Three components integrated seamlessly – the driver doesn’t need to know how it works.  It is the same idea with the Teradata UDA, which is a hybrid architecture for extracting the most insights per unit of time – at least doubling your insight capabilities at reasonable cost. And, business users don’t need to know all of the gory details. Teradata builds analytic engines—much like the hybrid drive train Toyota builds— that are optimized and used in combinations with different ecosystem tools depending on customer preferences and requirements, within their overall data architecture.

In the case of the hybrid car, battery power and braking systems, which recharge the battery, are the “new innovations” combined with gas-powered engines. Similarly, there are several innovations in data management and analytics that are shaping the unified data architecture, such as discovery platforms and Hadoop. Each customer’s architecture is different depending on requirements and preferences, but the Teradata Unified Data Architecture recommends three core components that are key components in a comprehensive architecture – a data platform (often called “Data Lake”), a discovery platform and an integrated data warehouse. There are other components such as event processing, search, and streaming which can be used in data architectures, but I’ll focus on the three core areas in this blog post.

Data Lakes

In many ways, this is not unlike the operational data store we’ve seen between transactional systems and the data warehouse, but the data lake is bigger and less structured. Any file can be “dumped” in the lake with no attention to data integration or transformation. New technologies like Hadoop provide a file-based approach to capturing large amounts of data without requiring ETL in advance. This enables large-scale data processing for data refining, structuring, and exploring data prior to downstream analysis in workload-specific systems, which are used to discover new insights and then move those insights into business operations for use by hundreds of end-users and applications.

Discovery Platforms

Discovery platforms are a new workload-specific system that is optimized to perform multiple analytic techniques in a single workflow to combine SQL with statistics, MapReduce, graph, or text analysis to look at data from multiple perspectives. The goal is to ultimately provide more granular and accurate insights to users about their business. Discovery Platforms enable a faster investigative analytical process to find new patterns in data, identify different types fraud or consumer behavior that traditional data mining approaches may have missed.

Integrated Data Warehouses

With all the excitement about what’s new, companies quickly forget the value of consistent, integrated data for reuse across the enterprise. The integrated data warehouse has become a mission-critical operational system which is the point of value realization or “operationalization” for information. The data within a massively parallel data warehouse has been cleansed, and provides a consistent source of data for enterprise analytics. By integrating relevant data from across the entire organization, a couple key goals are achieved. First, they can answer the kind of sophisticated, impactful questions that require cross-functional analyses. Second, they can answer questions more completely by making relevant data available across all levels of the organization. Data lakes (Hadoop) and discovery platforms complement the data warehouse by enriching it with new data and new insights that can now be delivered to 1000’s of users and applications with consistent performance (i.e., they get the information they need quickly).

A critical part of incorporating these novel approaches to data management and analytics is putting new insights and technologies into production in reliable, secure and manageable ways for organizations.  Fundamentals of master data management, metadata, security, data lineage, integrated data and reuse all still apply!

The excitement of experimenting with new technologies is fading. More and more, our customers are asking us about ways to put the power of new systems (and the insights they provide) into large-scale operation and production. This requires unified system management and monitoring, intelligent query routing, metadata about incoming data and the transformations applied throughout the data processing and analytical process, and role-based security that respects and applies data privacy, encryption and other policies required. This is where I will spend a good bit of time on my next blog post.