database

Integrate Data, Processes & People: End Data Ownership Turf Wars

Posted on: November 18th, 2014 by Guest Blogger No Comments

 

The biggest thing I’ve realized over the past couple of months, other than Tony Romo is one of the best NFL players I’ve ever seen -- is the fact that data ownership is still a huge problem for companies of all types. Tony’s numbers keep getting more impressive game by game – and likewise the pace of data streaming into organizational information channels rises by the hour.

Romo aside, the dramatic growth of data volume only seems to rekindle data ownership issues among so many internal departments, who don’t or won’t see the advantages of sharing information. Maybe call it data ‘possessiveness.’ I really thought, and I have said in numerous presentations over the summer, that we’d solved the data ownership problem. To begin with, the industry seems to have understood the value of data integration and optimization, and a ‘single view of the customer,’ once just a hazy vision in the distance, was now becoming a technological reality, so isn’t ‘data sharing’ a no-brainer?

But after presentations at a few conferences this fall, including Teradata Partners, folks have come to say things like, “Everyone in all of our marketing areas wants data from the Customer Insights group -- but they won’t play nice and share it.” Others have made similar complaints at reception chats or over lunch. These little ‘data dramas’ and ‘turf tussles’ surprise me.

What also continues to surprise me is the lack of marketing participation at many IT gatherings. When I lead a session on how to use big data in marketing, the audience is usually 90% data scientists or IT specialists and 10% marketing. Sunday, in my well-attended session at the Partners event, it was zero marketers. What’s up with that? Where are those savvy ‘data-driven’ marketers?

On to Monday, where we held a lunch for 30 retailers and CPG firms. Other than the one analyst from IDC, the rest were from companies like Safeway, Target, HEB, Williams & Sonoma, Hallmark, etc. While some work in marketing, none viewed themselves as ‘marketers’ – they were data scientists or IT specialists. Great people, truly interested in Dynamic Customer Strategy, and like Sunday’s session, it went very well. But again, no CMOs, no marketing directors, no merchandisers. Baffling!

Trend-watchers report that marketing is supposed to be the biggest spender on IT by 2017, outpacing the IT department. Somehow, the CMO is supposed to become the most proficient IT buyer on the planet. So when does the foundational due diligence take place?

Reading a few white papers or looking at where a particular solution is -- in some magic matrix -- is not sufficient. Someone thinks, “Oh, we can make money with marketing automation tools here. Let’s get one and get some data and go to work.” And then they demand the data from the data group, or from some other group so they can get their work done without thinking about the greater good. Sharing data always results in the greater good, right?

Data possessiveness has become the modern tragedy of the commons, a phrase coined to describe the overgrazing that would occur when everyone shared a common pasture (like the Boston commons).

In this modern-day tragedy, there are two outcomes. First comes technology bloat, and with technology bloat comes lots of little not-playing-well-with-others data sets and an insufficient data strategy. Maybe they can import the data in -- but not out.

In case you are unfamiliar with the term, technology bloat was coined by my former student and now consultant Ben Becker (beckerstrategies.com) to describe the common situation of multiple overlapping software solutions. In environments where data silos and turf battles over applications exist, technology bloat is a huge challenge for IT: Multiple systems to support when one would do, budget-crushing agreements when rationalization would be less expensive, and so on and so on.

That’s why I found it interesting that Michael Koehler, Teradata’s CEO, emphasized integration as the key watchword for 2015. Integration clearly is a play that works well for Teradata, especially with its full suite of solutions. But when marketing is spending more than IT on IT and doesn’t know how, there’s a tall challenge.

Another causal factor of technology bloat is ‘how’ marketing budgets and spends funds for IT. The budget to acquire may not even be an IT budget but come out of monies allocated to a particular program or profit center. The campaigns budget, for example, might be used to buy a campaign management tool. As long as revenue targets are hit, all is well from a budgetary perspective, at least as far as marketing management is concerned. Of course, no matter that it’s the third campaign management tool that the organization purchased.

Similarly, there's the revenue ownership problem. In spite of attribution modeling that can weight the effectiveness of each element in the marketing mix and apportion revenue accordingly, each profit center is unwilling to share revenue or customers. The result is customers who delete and ignore every marketing message from their former partners who now over-market because they won’t/can’t share data. In my department alone, I know of at least three different CRM systems.

Moreover, marketers just want to do marketing, and especially the cool marketing. I get that. It's fun to see marketing strategy actually lead to revenue, whether you're in B2C and actual sales are immediately triggered or B2B and the work is mostly above the funnel.

I suspect, though, that the problem is greater in B2B. When we did the study in retailing earlier this year, we were far less likely to identify data ownership as a bottleneck. Retailers are more mature than B2B in the whole data thing anyway, but there are also fewer marketing areas. B2B companies tend to be organized by product or vertical market and each operates as a separate business unit. Marketing departments or teams proliferate, and that leads to technology bloat etc.

While the simple answer might be that IT should be making more decisions, I don’t see that as realistic. And if Koehler is correct that we’re in for a period of integration, then I suspect that will mean consolidation of tools and applications into suites. What’s interesting to me is that Teradata seems to be the lone voice, even among full-suite providers, crying out to end technology bloat.

However, I agree that a period of integration is coming. When the CIO can demonstrate to the CMO how integration can improve revenue through better data and marketing strategies while reducing costs (and in that order), most CMOs will make that move.

Data possessiveness will fade as the benefits of sharing integrated data become ubiquitous and irresistible. Data dramas will cease, and marketers will more enthusiastically participate in IT conferences.

It’s called teamwork – something Tony Romo totally understands.

best tanner blog bio 1-1

 

 

 

Dr. Jeff R. Tanner is Professor of Marketing and the Executive Director of Baylor University’s Innovative Business Collaboratory. He regularly speaks at conferences such as CRM Evolution, Teradata Partners, Retail Technology, INFORMS, and others. Author or co-author of 15 books, including his newest, Analytics and Dynamic Customer Strategy, he is an active consultant to organizations such as Lawrence Livermore National Laboratory, Pearson-Prentice Hall, and Cabela’s.

 

It happens every few years and it’s happening again. A new technology comes along and a significant segment of the IT and business community want to toss out everything we’ve learned over the past 60 years and start fresh. We “discover” that we’ve been wasting time applying unnecessary rigor and bureaucracy to our projects. No longer should we have to wait three to six months or longer to deliver technical solutions to the business. We can turn these things around in three to six days or even less.

In the mid 1990’s, I was part of a team that developed a “pilot” object-oriented, client-server (remember when these were the hot buzzwords?) application to replenish raw materials for a manufacturing function. We were upending the traditional mainframe world by delivering a solution quickly and iteratively with a small team. When the end users started using the application in real life, it was clear they were going to rely on it to do their jobs every day. Wait, was this a pilot or…? I would come into work in the morning, walk into a special room that housed the application and database servers, check the logs, note any errors, make whatever fixes needed to be made, re-run jobs, and so on.

It wasn’t long before this work began to interfere with my next project, and the end users became frustrated when I wasn’t available to fix problems quickly. It took us a while and several conversations with operations to determine that “production” didn’t just mean “the mainframe”. “Production” meant that people were relying on the solution on a regular basis to do their jobs. So we backtracked and started talking about what kind of availability guarantees we could make, how backup and recovery should work, how we could transition monitoring and maintenance to operations, and so on. In other words, we realized what we needed was a traditional IT project that just happened to leverage newer technologies.

This same scenario is happening today with Hadoop and related tools. When I visit client organizations, a frightening number will have at least one serious person saying something like, “I really don’t think ‘data warehousing’ makes sense any more. It takes too long. We should put all our data in Hadoop and let our end users access whatever they want.” It is indeed a great idea to establish an environment that enables exploration and quick-turnaround analysis against raw data and production data. But to position this approach as a core data and analytics strategy is nothing short of professional malpractice.

The problem is that people are confusing experimentation with IT projects. There is a place for both, and there always has been. Experimentation (or discovery, research, ad-hoc analysis, or whatever term you wish to use) should have lightweight processes and data management practices – it requires prioritization of analysis activity, security and privacy policies and implementation, some understanding of available data, and so on, but it should not be overburdened with the typical rigor required of projects that are building solutions destined for production. Once a prototype is ready to be used on a regular basis for important business functions, that solution should be built through a rigorous IT project leveraging an appropriate – dare I say it – solution development life cycle (SDLC), along with a comprehensive enterprise architecture plan including, yes, a data warehouse that provides integrated, shared, and trusted production data.

An experimental prototype should never be “promoted” to a production environment. That’s what a project is for. Experimentation can be accomplished with Hadoop, relational technology, Microsoft Office, and many other technologies. These same technologies can also be used for production solutions. So, it’s not that “things are done differently and more quickly in Hadoop”. Instead, it’s more appropriate to say that experimentation is different than an IT project, regardless of technology.

Yes, we should do everything we can to reduce unnecessary paperwork and to speed up delivery using proper objective setting, scoping, and agile development techniques. But that is different than abandoning rigor altogether. In fact, using newer technologies in IT projects requires more attention to detail, not less, because we have to take the maturity of the technology into consideration. Can it meet the service level needs of a particular solution? This needs to be asked and examined formally within the project.

Attempting to build production solutions using ad-hoc, experimental data preparation and analysis techniques is like building a modern skyscraper with a grass hut mentality. It just doesn’t make any sense.

Guest Blogger Kevin Lewis is responsible for Teradata’s Strategy and Governance practice. Prior to joining Teradata in 2007, he was responsible for initiating and leading enterprise data management at Publix Super Markets. Since joining Teradata, he has advised dozens of clients in all major industries. 

Take a Giant Step with Teradata QueryGrid

Posted on: April 29th, 2014 by Dan Graham No Comments

 

Teradata 15.0 has gotten tremendous interest from customers and the press because it enables SQL access to native JSON data. This heralds the end of the belief that data warehouses can’t handle unstructured data. But there’s an equally momentous innovation in this release called Teradata QueryGrid.

What is Teradata QueryGrid?
In Teradata’s Unified Data Architecture (UDA), there are three primary platforms: the data warehouse, the discovery platform, and the data platform. The huge gray arrows represent data flowing between these systems. A year or two ago, these arrows were extract files moved in batch mode.

Teradata QueryGrid is both a vision and a technology. The vision --simply said-- is that a business person connected to the Teradata Database or Aster Database can submit a single SQL query that joins data together from a second system for analysis. There’s no need to plead with the programmers to extract data and load it into another machine. The business person doesn’t have to care where the data is – they can simply combine relational tables in Teradata with tables or flat files found in Hadoop on demand. Imagine a data scientist working on an Aster discovery problem and needing data from Hadoop. By simply adjusting the queries she is already using, Hadoop data is fetched and combined with tables in the Aster Database. That should be a huge “WOW” all by itself but let’s look further.

You might be saying “That’s not new. We’ve had data virtualization queries for many years.” Teradata QueryGrid is indeed a form of data virtualization. But Teradata QueryGrid doesn’t suffer from the normal limitations of data virtualization such as slow performance, clogged networks, and security concerns.

Today, the vision is translated into reality as connections between Teradata Database and Hadoop as well as Aster Databases and Hadoop. Teradata QueryGrid also connects the Teradata Data Warehouse to Oracle databases. In the near future, it will extend to all combinations of UDA servers such as Teradata to Aster, Aster to Aster, Teradata to Teradata, and so on.

Seven League Boots for SQL
With QueryGrid, you can add a clause in a SQL statement that says “Call up Hadoop, pass Hive a SQL request, receive the Hive results, and join it to the data warehouse tables.” Running a single SQL statement spanning Hadoop and Teradata is amazing in itself – a giant step forward. Notice too that all the database security, advanced SQL functions, and system management in the Teradata or Aster system is supporting these queries. The only effort required is for the database administrator to set up a “view” that connects the systems. It’s self-service for the business user after that. Score: complexity zero, business users one.

Parallel Performance, Performance, Performance
Historically, data virtualization tools lack the ability to move data between systems in parallel. Such tools send a request to a remote database and the data comes back serially through an Ethernet wire. Teradata QueryGrid is built to connect to remote systems in parallel and exchange data through many network connections simultaneously. Wanna move a terabyte per minute? With the right configurations it can be done. Parallel processing by both systems makes this incredibly fast. I know of no data virtualization system that does this today.

Inevitably, the Hadoop cluster will have a different number of servers compared to the Teradata or Aster MPP systems. The Teradata and Aster systems start the parallel data exchange by matching up units of parallelism between the two systems. That is, all the Teradata parallel workers (called AMPs) connect to a buddy Hadoop worker node for maximum throughput. Anytime the configuration changes, the workers match-up changes. This is non-trivial rocket-science class technology. Trust me – you don’t want to do this yourself and the worst situation would be to expose this to the business users. But Teradata QueryGrid does it all for you completely invisible to the user.

Put Data in the Data Lake FAST
Imagine complex predictive analytics using R® or SAS® are run inside the Teradata data warehouse as part of a merger and acquisition project. In this case, we want to pass this data to the Hadoop Data Lake where it is combined with temporary data from the company being acquired. With moderately simple SQL stuffed in a database view, the answers calculated by the Teradata Database can be sent to Hadoop to help finish up some reports. Bi-directional data exchange is another breakthrough in the Teradata Query Grid, new in release 15.0. The common thread in all these innovations is that the data moves from the memory of one system to the memory of the other. No extracts, no landing the data on disk until the final processing step – and sometimes not even then.

Push-down Processing
What we don’t want to do is transfer terabytes of data from Hadoop and throw away 90% of it since it’s not relevant. To minimize data movement, Teradata QueryGrid sends the remote system SQL filters that eliminate records and columns that aren’t needed. An example constraint could be “We only want records for single women age 30-40 with an average account balance over $5000. Oh, and only send us the account number, account type, and address.” This way, the Hadoop system discards unnecessary data so it doesn’t flood the network with data that will be thrown away. After all the processing is done in Hadoop, data is joined in the data warehouse, summarized, and delivered to the user’s favorite business intelligence tool.

Teradata QueryGrid delivers some important benefits:
• It’s easy to use: any user with any BI tool can do it
• Low DBA labor: it’s mostly setting up views and testing them once
• High performance: reducing hours to minutes means more accuracy and faster turnaround for demanding users
• Cross-system data on demand: don’t get stuck in programmer’s work queue
• Teradata/Aster strengths: security, workload management, system management
• Minimum data movement improves performance and reduces network use
• Move the processing to the data

Big data is now taking giant steps through your analytic architecture --frictionless, invisible, and in parallel. Nice boots!

 

In the Star Trek movies, “the Borg” refers to an alien race that conquers all planets, absorbing the people, technology, and resources into the Borg collective. Even Captain Picard becomes a Borg and chants “We are the Borg. You will be assimilated. Resistance is futile.”

It strikes me that the relational database has behaved similarly since its birth. Over the last thirty years, Teradata and other RDBMS vendors have innovated and modernized, constantly revitalizing what it means to be an RDBMS. But some innovations come from start-up companies that are later assimilated into the RDBMS. And some innovations are reactions to competition. Regardless, many innovations eventually end up in the code base of multiple RDBMS vendor products --with proper respect to patents of course. Here are some examples of cool technologies assimilated into Teradata Database:

• MOLAP cubes storm the market in the late 1990s with Essbase setting the pace and Cognos inventing desktop cubes. MicroStrategy and Teradata team up to build push-down ROLAP SQL into the database for parallel speed. Hyperion Essbase and Teradata also did Hybrid OLAP integration together. Essbase gets acquired, MOLAP cubes fall out of fashion, and in-database ROLAP goes on to provide the best of both worlds as CPUs get faster.

• Early in the 2000s, a startup called Sunopsis shows a distinct advantage of running ELT transformations in-database to get parallel performance with Teradata. ELT takes off in the industry like a rocket. Teradata Labs also collaborates with Informatica to push-down PowerCenter transformation logic into SQL for amazing extract, load, and transform speed. Sunopsis gets acquired. More ETL vendors adopt ELT techniques. Happy DBAs and operations managers meet their nightly batch performance goals. More startups disappear.

• XML and XQuery becomes the rage in the press -- until most every RDBMS adds a data type for XML --plus shred and unshred operators. XML-only database startups are marginalized.

• The uptick of predictive analytics in the market drives collaboration between Teradata and SAS back in 2007. SAS Procs are pushed-down into the database to run massively parallel, opening up tremendous performance benefits for SAS users. This leads to many RDBMS vendors who copy this technique; SAS is in the limelight, and eventually even Hadoop programmers want to run SAS in parallel. Later we see “R,” Fuzzy Logix, and others run in-database too. Sounds like the proverbial win-win to me.

• In-memory technology from QlikView and TIBCO SpotFire excites the market with order-of magnitude performance gains. Several RDBMS vendors then adopt in-memory concepts. But in-memory has limitations on memory size and cost vis-à-vis terabytes of data. Consequently, Teradata introduces Teradata Intelligent Memory that caches hot data automatically in-memory while managing many terabytes of hot and cold data on disk. Two to three percent of the hottest data is managed by data temperature (aka - popular with users), delivering superfast response time. Cool! Or is it hot?

• After reading the Google research paper on MapReduce, a startup called “AsterData” invents SQL-MapReduce (SQL-MR) to add flexible processing to a flexible database engine. This cool innovation causes Teradata to acquire AsterData. Within a year, Aster strikes a nerve across the industry – MapReduce is in-database! This month, Aster earns numerous #1 scores in Ovum’s “Decision Matrix: Selecting an Analytic Database 2013-14” Jan 2014. The race is on for MapReduce in-database!

• The NoSQL community grabs headlines with their unique designs and reliance on JSON data and key-value pairs. MongoDB is hot, using JSON data while CouchBase and Cassandra leverage key-value stores. Teradata promptly decides to add JSON data (unstructured data) to the database and goes the extra mile to put JSONPath syntax into SQL. Teradata also adds the name-value-pair SQL operator (NVP) to extract JSON or key-value store data from weblogs. Schema-on-read technology gets assimilated into the Teradata Database. Java programmers are pleased. Customers make plans. More wins.

--------------------------------------------------------------------------------------------------------

“One trend to watch going forward, in addition to the rise of multi-model NoSQL databases, is the integration of NoSQL concepts into relational databases. One of the methods used in the past by relational database vendors to restrict the adoption of new databases to handle new data formats has been to embrace those formats within the relational database. Two prime examples would be support for XML and object-oriented programming.”
- Matt Aslett, The 451 Group, Next-Generation Operational Databases 2012-2016, Sep 17, 2013

--------------------------------------------------------------------------------------------------------

I’ve had conversations with other industry analysts and they’ve confirmed Matt’s opinion: RDBMS vendors will respond to market trends, innovations, and competitive threats by integrating those technologies into their offering. Unlike the Borg, a lot of these assimilations by RDBMS are friendly collaborations (MicroStrategy, Informatica, SAS, Fuzzy Logix, Revolution R, etc.). Others are just the recognition of new data types that need to be in the database (JSON, XML, BLOBs, geospatial, etc.).

Why is it good to have all these innovations inside the major RDBMS’s? Everyone is having fun right now with their science projects because hype is very high for this startup or that startup or this shiny new thing. But when it comes time to deploy production analytic applications to hundreds or thousands of users, all the “ities” become critical all of a sudden – “ities” that the new kids don’t have and the RDBMS does. “ities” like reliability, recoverability, security, and availability. Companies like Google can bury shiny new 1.oh-my-god quality software in an army of brilliant computer scientists. But Main Street and Wall Street companies cannot.

More important, many people are doing new multi-structured data projects in isolation -- such as weblog analysis, sensor data, graph analysis, or social text analysis. Soon enough they discover the highest value comes from combining that data with all the rest of the data that the organization has collected on customers, inventories, campaigns, financials, etc. Great, I found a new segment of buyer preferences. What does that mean to campaigns, sales, and inventory? Integrating new big data into an RDBMS is a huge win going forward – much better than keeping the different data sets isolated in the basement.

Like this year’s new BMW or Lexus, RDBMS’s modernize, they define modern. But relational database systems don’t grow old, they don’t rust or wear out. RDBMS’s evolve to stay current and constantly introduce new technology.

We are the RDBMS! Technology will be assimilated. Resistance is futile.

Anna Littick and the Unified Data Architecture — Part 2

Posted on: October 16th, 2013 by Dan Graham 1 Comment

 

Ring ring ringtone.
Dan: “Hello. This is Dan at Teradata. How can I help you today?”

Anna: “Hi Dan. It’s Anna Littick from Sunshine-Stores calling again. Can we finish our conversation?”

Dan: “Oh yeah, hi Anna. Sure. Where did we leave off?”

Anna: “Well, you remember our new CFO – Xavier Money -- wants us to move everything to Hadoop because he thinks it’s all free. You and I were ticking through his perceptions.”

Dan: “Yes. I think got through the first two but not number 3 and 4. Here’s what I remember:
1. Hadoop replaces the data warehouse
2. Hadoop is a landing zone and archive
3. Hadoop is a database
4. Hadoop does deep analytics.”

Anna: “Yep. So how do I respond to Xavier about those two?”

Dan: “Well, I guess we should start with ‘what is a database?’ I’ll try to keep this simple. A database has these characteristics:
• High performance data access
• Robust high availability
• A data model that isolates the schema from the application
• ACID properties

There’s a lot more to a database but these are the minimums. High speed is the name of the game for databases. Data has to be restructured, indexed, with a cost-based optimizer to be fast. Hive and Impala does a little restructuring of data but are a long way off from sophisticated indexes, partitioning, and optimizers. Those things take many years – each. For example, Teradata Database has multiple kinds of indexes like join indexes, aggregate indexes, hash indexes, and sparse indexes.”

Anna: “Ouch. What about the other stuff? Does Hive or Impala have that?”

Dan: “Well, high performance isn’t interesting if the data is not available. Between planned and unplanned downtime, a database has to hit 99.99% uptime or better to be mission critical. That’s roughly 53 minutes of downtime a year. Hundreds of hardware, software, and installation features have to mature to get there. I’m guessing a well-built Hadoop cluster is around 99% uptime. But just running out of memory in an application causes the cluster to crash. There’s a lot of work to be done in Hadoop.”

“Second, isolating the application programs from the schema is opposite Hadoop’s strategic direction of schema-on-read. They don’t want fixed data types and data rules enforcement. On the upside this means Hadoop has a lot of flexibility – especially with complex data that changes a lot. On the downside, we have to trust every programmer to validate and transform every data field correctly at runtime. It’s dangerous and exciting at the same time. The schema-on-read works great with some kinds of data, but the majority of data works better with a fixed schema.”

Anna: “I’ll have to think about that one. I like the ‘no rules’ flexibility but I don’t like having to scrub the incoming data every time. I already spend too much time preparing data for predictive analytics.”

Dan: “Last is the ACID properties. It’s a complex topic you should look at on Wikipedia. It boils down to trusting the data as it’s updated. If a change is made to an account balance, ACID ensures all the changes are applied or none, that no one else can change it at the same time you do, and that the changes are 100% recoverable across any kind of failure. Imagine you and your spouse at an ATM withdrawing $500 when there’s only $600 in the account. The database can’t give both of you $500 –that’s the ACID at work. Neither Hadoop, Hive, Impala, nor any other project has plans to build the huge ACID infrastructure and become a true database. Hadoop system isn’t so good at updating data in place. ”

“According to Curt Monash ‘Developing a good DBMS requires 5-7 years and tens of millions of dollars. That’s if things go extremely well. 1’ ”

Anna: “OK, Hadoop and Hive and Impala aren’t a database. So what? Who cares what you call it?”

Dan: “Well, a lot of end users, BI tools, ETL tools, and skills are expecting Hadoop to behave like a database. That’s not fair. It was not built for that purpose. Hadoop lacks a lot of functionality not being a database but it forces Hadoop to innovate and differentiate its strengths. Let’s not forget Hadoop’s progress in basic search indexing, archival of cold data, simple reporting at scale, and image processing. We’re at the beginning of a lot of innovation and it’s exciting.”

Anna: “OK. I’ll trust you on that. What about deep analytics? That’s what I care about most.”

Dan: “So Anna, off the record, you being a data scientist and all that. Do people tease you about your name? I mean Anna Littick the data scientist? I Googled you and you’re not the only one. ”

Anna: “Yes. Some guys around here think it’s funny. Apparently childishness isn’t limited to children. So during meetings I throw words at them like Markov Chains, Neural Networks, and edges in graph partitions. They pretend to understand --they nod a lot. Those guys never tease me again. [laugh]”

Dan: “Hey, those advanced analytics you mentioned are powerful stuff. You should hear David Simmen talk at our PARTNERS conference on Sunday. He’s teaching about our new graph engine that handles millions of vertices and billions of edges. It sounds like you would enjoy it.”

Anna: “Well, it looks like have approval to go, especially since PARTNERS is here in Dallas. Enough about me. What about deep analytics in Hadoop?”

Dan: “Right. OK, well first I have to tell you we do a lot of predictive and prescriptive analytics in-database with Teradata. I suspect you’ve been using SAS algorithms in-database already. The parallelism makes a huge difference in accuracy. What you probably haven’t seen is our Aster Database where you can run map-reduce algorithms under the control of SQL for fast, iterative discovery. It can run dozens of complex analytic algorithms including map-reduce algorithms in parallel. And we just added the graph engine in our 6.0 release. I mentioned. And one thing it does that Hadoop doesn’t is you can use your BI tools, SAS procs, and map-reduce all in one SQL statement. It’s ultra cool.”

Anna: “OK. I think I’ll go to David’s session. But what about Hadoop? Can it do deep analytics?”

Dan: “Yes. Both Aster and Hadoop can run complex predictive and prescriptive analytics in parallel. They can both do statistics, random forests, Markov Chains, and all the basics like naïve Bayes and regressions. If an algorithm is hard to do in SQL, these platforms can handle it.”

Anna [impatient]: “OK. I’ll take the bait. What’s the difference between Aster and Hadoop?”

Dan: “Well, Aster has a database underneath its SQL-MapReduce so you can use the BI tools interactively. There is also a lot of emphasis on behavioral analysis so the product has things like Teradata Aster nPath time-series analysis to visualize patterns of behavior and detect many kinds of consumer churn events or fraud. Aster has more than 80 algorithms packaged with it as well as SAS support. Sorry, I had to slip that Aster commercial in. It’s in my contract --sort of. Maybe. If I had a contract.”

Anna: “ And what about Hadoop?”

Dan: “Hadoop is more of a do-it-yourself platform. There are tools like Apache Mahout2 for data mining. It doesn’t have as many algorithms as Aster so you often find yourself getting algorithms from University research or GitHub and implementing them yourself. Some Teradata customers have implemented Markov Chains on Hadoop because it’s much easier to work with than SQL for that kind of algorithm. . So data scientists have more tools than ever with Teradata in-database algorithms, Aster SQL-MapReduce, SAS, and Hadoop/Mahout and others. That’s what our Unified Data Architecture does for you – it matches workloads to the best platform for that task.”

Anna: “OK. I think I’ve got enough information to help our new CFO. He may not like me bursting his ‘free-free-free’ monastic chant. But just because we can eliminate some initial software costs doesn't mean we will save any money. I’ve got to get him thinking of the big picture for big data. You called it UDA, right?”

Dan: “Right. Anna, I’m glad I could help, if only just a little. And I’ll send you a list of sessions at Teradata PARTNERS where you can hear from experts about their Hadoop implementations – and Aster. See you at PARTNERS.”

Title

Company

Day

Time

Comment

Aster Analytics: Delivering results with R Desktop

Teradata

Sun

9:30

RevolutionR

Do’s and Don’ts of using Hadoop in practice

Otto

Sun

1:00

Hadoop

Graph Analysis with Teradata Aster Discovery Platform

Teradata

Sun

2:30

Graph

Hadoop and the Data Warehouse: When to use Which

Teradata

Sun

4:00

Hadoop

The Voices of Experience: A Big Data Panel of Experts

Otto, Wells Fargo

Wed

9:30

Hadoop

An Integrated Approach to Big Data Analytics using Teradata and Hadoop

PayPal

Wed

11:00

Hadoop

TCOD: A Framework for the Total Cost of Big Data

WinterCorp

Wed

11:00

Costs

 1 Curt Monash, DBMS development and other subjects, March 18, 2013

Big Elephant Eats Data Warehouse

Posted on: September 19th, 2013 by Dan Graham No Comments

 

-- Teradata PR pit boss: “Dan, have you seen this Big Elephant Eats Data Warehouse article at BigMedia.com? This cub reporter guy’s like Rip Van Winkle waking up and trying to explain how the iPhone works. He’s just making things up. Get this Willy Everlern reporter on the phone.”

Ring ring ringtone. “Hello, Willy? Willy Everlern? This is Dan at Teradata again.”
--Willy: “Oh hi Dan. What’s happening out in Silicon Valley?”

--Dan: “It’s your latest blog Willy. That Big Elephant Eats Data Warehouse is clear, simple, and wrong. Hadoop has not stalled our data warehouse sales at all.”

--Willy: “Hey, I didn’t say that. Read it again. It says ‘Hadoop is eating the data warehouse market. Database heavyweights like Teradata are seeing slow growth because of Hadoop.’ See, I said slow growth --not no growth.”

--Dan: “Iszzat so? Willy, Hadoop is not Godzilla stomping on data warehouses --it’s a cute baby elephant, remember? A recent Data Warehousing Institute (TDWI) customer survey 78% of customers said ‘Hadoop complements a DW; it’s rarely a replacement. And IDC says Teradata database software grew at 14% last year and 14% the year before . How can you call that slow growth? I wish my retirement funds grew that slow every year. IDC also says analytic databases account for $11B dollars in 2012 – that’s just databases, no BI, ETL, hardware, and no services. According to Wikibon , the Hadoop market was around $256 million last year for software AND services. So even if half of that $256M was software revenue, its only 1% of the analytic databases software revenue.”

--Willy: “Well, I did do what you told me last time and talked to a Hadoop vendor who told me three of your customers -- A, B, and E-- offloaded data from Teradata to Hadoop. That’s why I said what I said.”

--Dan: “I’m glad you brought that up. All three of those companies offloaded low value data and processing from their Teradata Warehouse to Hadoop back in 2011. They did it so they could put new high value workloads into the data warehouse. Optimizing assets is just common sense for any CIO. But those new applications grew so fast that company A and company E bought huge Teradata system upgrades in 2012 at millions of dollars each. If that’s what Hadoop does to our data warehouses, we need more Hadoop. I encourage you to talk to vendors but when they tell you things like that, check out the other side of the story. Willy, that big white elephant isn’t taking market share or slowing our growth.”

--Willy: “Yellow.”

--Dan: “What?”

--Willy: “You called Hadoop a white elephant. It’s yellow.”

--Dan: “Sorry, Willy. I’m a joker. It’s a congenital disease in my family.”

--Dan: But on a serious note, my boss was really upset with the statement that ‘Hadoop is a whole new paradigm of analytics.’ Willy this one hurts. Companies like Teradata and SAS have been in the analytics business for 30 years. The BI/data warehouse community has been doing consumer 360 degree analysis, fraud detection, recommendation engines, risk, and profitability analysis for 20+ years. According to Gartner ‘There continues to be much hype about the advantages of open source community innovation. In this case, it’s often innovating what has been in SQL since standards like SQL92.’ Copying what databases have done since 1992 is not innovation --its 20 years of catching up.”

--Willy: “You don’t like Hadoop do you?”

--Dan: “Actually, I like Hadoop when it’s applied to the right workload --but I’m allergic to hype. You know Willy, Teradata sells Hadoop appliances so we are committed to its success. At Teradata, we co-invented SQL-H and HCatalog with Hortonworks for high speed data exchange. We even promote a reference architecture called Universal Data Architecture with Hadoop smack dab in the middle of it. But back to your point, if you want to see Hadoop innovation, look into YARN and Tez. Those Hortonworks guys are onto something.”

--Willy: “Well, you still have to admit that Hadoop is free where data warehouses cost $20,000 per terabyte. I found that on a dozen blogs and websites.”

--Dan: “Willy, don’t believe everything you hear on the internet. There’s websites out there that still think the moon landing was faked and TV wrestling is real. That stuff about Hadoop being free at $1000 a terabyte is self-contradicting. That’s Silly Con Valley hype at its worst. Recently The Data Warehousing Institute said ‘Hadoop is not free, as many people have mistakenly said about it. A number of Hadoop users speaking at recent TDWI conferences have explained that Hadoop incurs substantial payroll costs due to its intensive hand coding (normally done by high-payroll personnel such as data scientists) and its immature, non-productive tools…” Don’t get me wrong. Some Silicon Valley companies don’t use hype. I’ll also point you to Dr. Shacham -- Chief Data Scientist at PayPal --who did tests showing that the cost of a query on Hadoop was roughly the same as Teradata systems. That one’s a stunner!

Plus earlier this summer, Richard Winter, the all-time big data virtuoso, published research showing data warehouses are cheaper than Hadoop for – are you sitting down – queries and analytics. By the way Willy, we just had a ridiculous price reduction on our extreme data appliance that puts us incredibly close to Hadoop’s cost per terabyte.”

--Willy: “OK. OK. I get it. So there is a lot of internet hype about Hadoop. It’s getting so I don’t know who to trust anymore.”

--Dan: Well, I stick to my suggestion last month. You should probably talk to vendors first, then talk to Gartner, IDC, The Data Warehousing Institute, Ventana, and then some customers. And don’t forget to give me a call – I can hook you up with our customers who are living with Teradata and Hadoop.”

Later.
--Teradata PR pit boss: “Seems like Willy Everlern is struggling to learn.”
--Dan: “He’s not alone. I’m learning every day – I hope.”
---------
TDWI, Integrating Hadoop Into Business Intelligence and Data Warehousing, March 2013
IDC Worldwide Business Analytics Software 2013–2017 Forecast and 2012 Vendor Shares, June 2013
IDC Worldwide Business Analytics Software 2013–2017 Forecast and 2012 Vendor Shares, June 2013

http://wikibon.org/wiki/v/Hadoop-NoSQL_Software_and_Services_Market_Forecast_2012-2017

Gartner, Merv Adrian, Hadoop Summit Recap Part Two, http://blogs.gartner.com/merv-adrian , July 2013
TDWI, Integrating Hadoop Into Business Intelligence and Data Warehousing, March 2013
Dr. Nachum Shacham, Chief Data Scientist, eBay/PayPal, http://www-conf.slac.stanford.edu/xldb2011/talks/xldb2011_tue_1330_Shacham.pdf
Richard Winter, www.wintercorp.com/tcod-report, August 2013

Leading the Pack with Unified Data Architecture

Posted on: January 29th, 2013 by Scott Gnau No Comments

 

In the technology game, industry analysts are important players, and some would argue that Gartner is right up there near the top with their Magic Quadrant reports.  Those of us who follow Gartner’s Magic Quadrants know the importance of that deceptively simple-looking market research grid. Behind it lays a wealth of knowledge, with uniform criteria that bring useful snapshots of markets and their participants. I am again proud to see that Gartner’s latest MQ covering data warehousing and analytics continues to show Teradata leading the pack for our vision and performance.

Over the years, we’ve shared our vision with Gartner for a future where information is readily collected, processed and integrated in boundless configurations to allow businesses to exploit all of their data to their advantage. With the demonstrated success of data-driven organizations, we are again seeing our vision become a reality for many organizations capturing, analyzing and gaining insights from traditional and new data types in a heterogeneous environment.

This vision aligns perfectly to Gartner’s view of a Logical Data Warehouse. At a high level, Gartner defines the Logical Data Warehouse as an information management architecture where all data, including highly unstructured data, is stored and analyzed. This architecture includes technology approaches like data virtualization, distributed processes and ontological metadata, among other characteristics, as enabling a single version of the truth. In a recent Gartner blog, analyst Mark Beyer says that the “logical data warehouse is the next significant evolution of information integration …this is important. This is big. This is NOT buzz. This is real.”

We at Teradata agree. This is real, and our own vision and R&D investment has closely aligned with this. The fact that our best-in-class systems are available today and have the ability to analyze structured, unstructured and semi-structured data, or what we call multi-structured data as an all-up term, shows how long indeed we’ve been on this path.  The October 2012 release of the Teradata Unified Data Architecture introduced a new framework for business users – very much aligned with Gartner’s vision of the Logical Data Warehouse – to ask any question, against any data, with any analytic, at any time across multiple Teradata systems – analytical platforms and discovery platforms – and open source Hadoop data management platforms. This is the result of years of development and millions of dollars of R&D investment.  This investment has enabled us to be the first to deliver a solution like UDA to the market, empowering our customers to change their game by competing on analytics.

UDA

We continue to get positive reports from our customers as we allow organizations to deploy, support, manage, and seamlessly access all their data in an integrated and dynamic Teradata Unified Architecture. Teradata’s integration of these technologies, which our customers have learned is more than the sum of the individual components, creates real value.

These efforts are all in service to a vision of intelligent systems that leverages the value of data warehouse, data discovery and data staging technology. We believe in the value of open source technology, that’s why the Teradata Unified Data Architecture supports open source apache Hadoop.  The Teradata Unified Data Architecture is further certified with HDP from Hortonworks and enables a host of interoperability features, which allow for the transparent, seamless movement of data in and out of diverse systems for storage and refinement and analysis.

The Teradata Unified Data Architecture indeed represents the new normal in combining systems and approaches. It captures, refines, and stores detailed data in Hadoop. Teradata Aster then performs subsequent analysis for the discovery of new insights. And then, the resulting intelligence is made available by the Teradata Integrated Data Warehouse for use across the enterprise.

The Teradata Unified Data Architecture, with best-in-class technology, provides business users fast and seamless answers to their questions regardless of the type of data analyzed. In the process, we are embodying what Gartner and others value as the leadership in building practical solutions that help businesses derive the best insights possible from all their data, whether big, small, or somewhere in between.

Scott Gnau