Anna Littick and the Unified Data Architecture — Part 2

Posted on: October 16th, 2013 by Dan Graham 1 Comment

Ring ring ringtone.
Dan: “Hello. This is Dan at Teradata. How can I help you today?”

Anna: “Hi Dan. It’s Anna Littick from Sunshine-Stores calling again. Can we finish our conversation?”

Dan: “Oh yeah, hi Anna. Sure. Where did we leave off?”

Anna: “Well, you remember our new CFO – Xavier Money -- wants us to move everything to Hadoop because he thinks it’s all free. You and I were ticking through his perceptions.”

Dan: “Yes. I think got through the first two but not number 3 and 4. Here’s what I remember:
1. Hadoop replaces the data warehouse
2. Hadoop is a landing zone and archive
3. Hadoop is a database
4. Hadoop does deep analytics.”

Anna: “Yep. So how do I respond to Xavier about those two?”

Dan: “Well, I guess we should start with ‘what is a database?’ I’ll try to keep this simple. A database has these characteristics:
• High performance data access
• Robust high availability
• A data model that isolates the schema from the application
• ACID properties

There’s a lot more to a database but these are the minimums. High speed is the name of the game for databases. Data has to be restructured, indexed, with a cost-based optimizer to be fast. Hive and Impala does a little restructuring of data but are a long way off from sophisticated indexes, partitioning, and optimizers. Those things take many years – each. For example, Teradata Database has multiple kinds of indexes like join indexes, aggregate indexes, hash indexes, and sparse indexes.”

Anna: “Ouch. What about the other stuff? Does Hive or Impala have that?”

Dan: “Well, high performance isn’t interesting if the data is not available. Between planned and unplanned downtime, a database has to hit 99.99% uptime or better to be mission critical. That’s roughly 53 minutes of downtime a year. Hundreds of hardware, software, and installation features have to mature to get there. I’m guessing a well-built Hadoop cluster is around 99% uptime. But just running out of memory in an application causes the cluster to crash. There’s a lot of work to be done in Hadoop.”

“Second, isolating the application programs from the schema is opposite Hadoop’s strategic direction of schema-on-read. They don’t want fixed data types and data rules enforcement. On the upside this means Hadoop has a lot of flexibility – especially with complex data that changes a lot. On the downside, we have to trust every programmer to validate and transform every data field correctly at runtime. It’s dangerous and exciting at the same time. The schema-on-read works great with some kinds of data, but the majority of data works better with a fixed schema.”

Anna: “I’ll have to think about that one. I like the ‘no rules’ flexibility but I don’t like having to scrub the incoming data every time. I already spend too much time preparing data for predictive analytics.”

Dan: “Last is the ACID properties. It’s a complex topic you should look at on Wikipedia. It boils down to trusting the data as it’s updated. If a change is made to an account balance, ACID ensures all the changes are applied or none, that no one else can change it at the same time you do, and that the changes are 100% recoverable across any kind of failure. Imagine you and your spouse at an ATM withdrawing $500 when there’s only $600 in the account. The database can’t give both of you $500 –that’s the ACID at work. Neither Hadoop, Hive, Impala, nor any other project has plans to build the huge ACID infrastructure and become a true database. Hadoop system isn’t so good at updating data in place. ”

“According to Curt Monash ‘Developing a good DBMS requires 5-7 years and tens of millions of dollars. That’s if things go extremely well. 1’ ”

Anna: “OK, Hadoop and Hive and Impala aren’t a database. So what? Who cares what you call it?”

Dan: “Well, a lot of end users, BI tools, ETL tools, and skills are expecting Hadoop to behave like a database. That’s not fair. It was not built for that purpose. Hadoop lacks a lot of functionality not being a database but it forces Hadoop to innovate and differentiate its strengths. Let’s not forget Hadoop’s progress in basic search indexing, archival of cold data, simple reporting at scale, and image processing. We’re at the beginning of a lot of innovation and it’s exciting.”

Anna: “OK. I’ll trust you on that. What about deep analytics? That’s what I care about most.”

Dan: “So Anna, off the record, you being a data scientist and all that. Do people tease you about your name? I mean Anna Littick the data scientist? I Googled you and you’re not the only one. ”

Anna: “Yes. Some guys around here think it’s funny. Apparently childishness isn’t limited to children. So during meetings I throw words at them like Markov Chains, Neural Networks, and edges in graph partitions. They pretend to understand --they nod a lot. Those guys never tease me again. [laugh]”

Dan: “Hey, those advanced analytics you mentioned are powerful stuff. You should hear David Simmen talk at our PARTNERS conference on Sunday. He’s teaching about our new graph engine that handles millions of vertices and billions of edges. It sounds like you would enjoy it.”

Anna: “Well, it looks like have approval to go, especially since PARTNERS is here in Dallas. Enough about me. What about deep analytics in Hadoop?”

Dan: “Right. OK, well first I have to tell you we do a lot of predictive and prescriptive analytics in-database with Teradata. I suspect you’ve been using SAS algorithms in-database already. The parallelism makes a huge difference in accuracy. What you probably haven’t seen is our Aster Database where you can run map-reduce algorithms under the control of SQL for fast, iterative discovery. It can run dozens of complex analytic algorithms including map-reduce algorithms in parallel. And we just added the graph engine in our 6.0 release. I mentioned. And one thing it does that Hadoop doesn’t is you can use your BI tools, SAS procs, and map-reduce all in one SQL statement. It’s ultra cool.”

Anna: “OK. I think I’ll go to David’s session. But what about Hadoop? Can it do deep analytics?”

Dan: “Yes. Both Aster and Hadoop can run complex predictive and prescriptive analytics in parallel. They can both do statistics, random forests, Markov Chains, and all the basics like naïve Bayes and regressions. If an algorithm is hard to do in SQL, these platforms can handle it.”

Anna [impatient]: “OK. I’ll take the bait. What’s the difference between Aster and Hadoop?”

Dan: “Well, Aster has a database underneath its SQL-MapReduce so you can use the BI tools interactively. There is also a lot of emphasis on behavioral analysis so the product has things like Teradata Aster nPath time-series analysis to visualize patterns of behavior and detect many kinds of consumer churn events or fraud. Aster has more than 80 algorithms packaged with it as well as SAS support. Sorry, I had to slip that Aster commercial in. It’s in my contract --sort of. Maybe. If I had a contract.”

Anna: “ And what about Hadoop?”

Dan: “Hadoop is more of a do-it-yourself platform. There are tools like Apache Mahout2 for data mining. It doesn’t have as many algorithms as Aster so you often find yourself getting algorithms from University research or GitHub and implementing them yourself. Some Teradata customers have implemented Markov Chains on Hadoop because it’s much easier to work with than SQL for that kind of algorithm. . So data scientists have more tools than ever with Teradata in-database algorithms, Aster SQL-MapReduce, SAS, and Hadoop/Mahout and others. That’s what our Unified Data Architecture does for you – it matches workloads to the best platform for that task.”

Anna: “OK. I think I’ve got enough information to help our new CFO. He may not like me bursting his ‘free-free-free’ monastic chant. But just because we can eliminate some initial software costs doesn't mean we will save any money. I’ve got to get him thinking of the big picture for big data. You called it UDA, right?”

Dan: “Right. Anna, I’m glad I could help, if only just a little. And I’ll send you a list of sessions at Teradata PARTNERS where you can hear from experts about their Hadoop implementations – and Aster. See you at PARTNERS.”

Title

Company

Day

Time

Comment

Aster Analytics: Delivering results with R Desktop

Teradata

Sun

9:30

RevolutionR

Do’s and Don’ts of using Hadoop in practice

Otto

Sun

1:00

Hadoop

Graph Analysis with Teradata Aster Discovery Platform

Teradata

Sun

2:30

Graph

Hadoop and the Data Warehouse: When to use Which

Teradata

Sun

4:00

Hadoop

The Voices of Experience: A Big Data Panel of Experts

Otto, Wells Fargo

Wed

9:30

Hadoop

An Integrated Approach to Big Data Analytics using Teradata and Hadoop

PayPal

Wed

11:00

Hadoop

TCOD: A Framework for the Total Cost of Big Data

WinterCorp

Wed

11:00

Costs

 1 Curt Monash, DBMS development and other subjects, March 18, 2013

One Response

  1. Dan Graham

    February 8, 2014

    Your question is not specific so I will try to select some information sources that cover a wide range of topics

    Evaluating and Planning for the Real Costs of Big Data
    http://blogs.teradata.com/data-points/evaluating-and-planning-for-the-real-costs-of-big-data/

    Big Data: A Look at the Real Costs
    http://www.asterdata.com/webcasts/big-data-real-costs.html

    Total Cost of Big Data: a CFO’s Lesson from WinterCorp and HortonWorks
    http://blogs.teradata.com/data-points/total-cost-of-big-data/
    http://www.asterdata.com/big-data-cost/ white paper

    Gartner BLOGs
    Hadoop and DI – A Platform Is Not A Solution
    http://blogs.gartner.com/merv-adrian/2013/02/10/hadoop-and-di-a-platform-is-not-a-solution/

    Aspirational Marketing and Enterprise Data Hubs
    http://blogs.gartner.com/merv-adrian/2014/01/17/aspirational-marketing-and-enterprise-data-hubs/

    That Exciting New Stuff? Yeah… Wait Till It Ships.
    http://blogs.gartner.com/merv-adrian/2013/07/13/that-exciting-new-stuff-yeah-wait-till-it-ships/

    Hadoop Summit Recap Part Two – SELECT FROM hdfs WHERE bigdatavendor USING SQL
    http://blogs.gartner.com/merv-adrian/2013/07/15/hadoop-summit-recap-part-two-select-from-hdfs-where-bigdatavendor-using-sql/

    And Teradata
    http://www.teradata.com/

    Hope this helps

    Reply

Leave a comment

*