Hortonworks

Taking Charge of Your Data Lake Destiny

Posted on: April 30th, 2014 by Cesar Rojas No Comments

 

One of the most interesting areas of my job is having the opportunity to take an active role in helping shape the future of big data analytics. A significant part of this is the maturation of the open source offerings available to customers and how they can help address today’s analytic conundrums. Customers are constantly looking for new and effective ways to organize their data and they want to build the systems that will empower them to be successful across their organizations. But with the proliferation of data and the rise of data types and analytical models, solving this challenge is becoming increasingly complex.

One of the solutions that has become popular is the concept of a data lake. The idea of a data lake emerged when users were creating new types of data that needed to be captured and exploited across their enterprise. The concept is also tied quite closely to Apache Hadoop and its ecosystem of open source projects so, as you can imagine, since two of my main focus areas (big data analytics and Hadoop) are being brought together, this is an area to which I often pay close attention. Data lakes are designed to tackle some of the emerging big data challenges by offering a new way to organize and build the next generation of systems. They provide a cost effective and technologically refined way to approach and solve big data challenges. However, while data lakes are an important component of the logical data warehouse – because they are designed to give the user choices in order to better manage and utilize data within their analytical ecosystem – many users are also finding that the data lake is also an obvious evolution of their current Apache Hadoop ecosystem and their existing data architecture.

Where do we begin? Quite simply, several questions need to be answered before you start down this path. For instance, it’s important to understand how the data lake is related to your existing enterprise data warehouse, how they work together, and quite possibly the most important question is “What best practices should be leveraged to ensure the resulting strategy drives business value?”

A recent white paper written by CITO Research and sponsored by Teradata and Hortonworks, takes a close look at the data lake and provides answers to all of the above questions, and then some. Without giving away too much of the detail, I thought I would capture a few of the points that impress me most in this paper.

In fact, the data lake has come a long way since its initial entry into the big data scene. Its first iteration included several limitations, making it slightly daunting to general users. The original data lakes were batch-oriented, offering very limited abilities for user interaction with the data, and expertise with MapReduce and other scripting and query capabilities were absolutely necessary for success. Those factors, among others, limited its ability to be widely adopted. Today, however, the landscape is changing. With the arrival of Hadoop 2, and more specifically the release 2.1 of Hortonworks, data lakes are evolving. New Hadoop projects bolstered better resource management and application multi-tenancy allowing multiple workloads on the same cluster that enable users from different business units within organizations to effectively refine, explore, and enrich data. Today, enterprise Hadoop is a full-fledged data lake, with new capabilities being added all the time.

While the capabilities of a data lake evolved over the last few years, so did the world of big data. Companies everywhere started creating data lakes to complement the capabilities of their data warehouses but now must also tackle creating a logical data warehouse in which the data lake and the enterprise data warehouse can be maximized individually -- and yet support each other in the best way possible as well.

The enterprise data warehouse plays a critical role in solving big data challenges, and together with the data lake, the possibilities can deliver real business value. The enterprise data warehouse is a highly designed sophisticated system that provides a single version of the truth that can be used over and over again. And, like a data lake, it supports batch workloads. Unlike a data lake, the enterprise data warehouse also supports simultaneous use by thousands of concurrent users performing reporting and analytic tasks.

There are several impressive uses for a data lake and several beneficial outcomes can result. It is very worthwhile to learn more about data lakes and how they can help you to store and process data at low cost. You can also learn how to create a distributed form of analytics, or learn how the data lake and the enterprise data warehouse have started to work together as a hybrid, unified system that empowers users to ask questions that can be answered by more data and more analytics with less effort. To start learning about these initiatives, download our whitepaper here.

By Cesar Rojas - bio link 

 

The recent webinar by Richard Winter and Bob Page hammered home key lessons about the cost of workloads running on Hadoop and data warehouses.  Richard runs WinterCorp -- a consulting company that has been implementing huge data warehouses for 20+ years.   Bob Page is Vice President of Products for Hortonworks, and before that he was at Yahoo! and eBay running big data projects.  The webinar explored Richard’s cost model for running various workloads on Hadoop and an enterprise data warehouse (EDW).  Richard built the cost model during a consulting engagement with a marketing executive of a large financial services company who was launching a big data initiative.  She had people coming to her saying “you should do it in Hadoop” and others saying “you should do it in the data warehouse.”  Richard’s cost model helped her settle some debates.

The Total Cost of Data analysis results are the basis for the webinar.  What separates Richard’s cost framework from most others is that it includes more than just upfront system costs.  The TCOD cost model also includes five years of programmer labor, data scientist labor, end user labor, maintenance upgrades, plus power and cooling.  Richard said there are 60 costs metrics in the model.  He recommends companies download the TCOD spreadsheet and insert actual local costs since system and labor costs differ by city and country.

For the Hadoop data lake workload (aka. data refinery), labor costs were fairly close between Hadoop and the data warehouse while system costs favored Hadoop.  In the case of the data warehouse workload, the data warehouse system cost was high (remember the power and cooling?) while the Hadoop labor cost structure skyrocketed.  Long story short, Hadoop as a data lake is lower cost than a data warehouse; and the data warehouse is lower cost for complex queries and analytics.

There was general agreement that Hadoop is a cost effective platform for ETL work – the staging of raw data and transforming it into refined value.   But when asked “should we offload ELT/ETL to Hadoop?” Bob Page said:

I think it’s going to be data dependent.  It also depends on what the skills are in the organization.  I experienced it myself when I was running big data platforms.  If there is a successful implementation on the EDW today, there may be a couple reasons why it makes sense to keep it there.  One reason is there may be years and years of business logic encoded, debugged, and vetted.  Moving that to another platform with its inherent differences, you might ask “what’s the value of doing that?” It may take a couple years to get that right and in the end all you have done is migrate to another platform.  I would prefer to invest those resources in adding additional value to the organization rather than moving sideways to another platform.”

 


When the data warehouse workload was costed out, Hadoop’s so called $1000 per terabyte turned out to be an insignificant part of the total.  However, Hadoop’s cost skyrockets because of the need for 1000s of queries being manually coded by high priced Hadoop and moderate priced Java programmers over five years.  The OPEX side of the pie chart was huge when the data warehouse workload was applied to Hadoop.

Richard explained:

The total cost of queries are much lower on the EDW than on Hadoop. SQL is a declarative language – you only have to tell it what you want.  In Hadoop you use a procedural language.  In Hadoop you have to tell the system how to find the data, how to bring it together, and what are the manipulations needed to deliver the results.  With the data warehouse, there is a sophisticated query optimizer that figures all that out automatically for you.  The cost of developing the query on the data warehouse is lower because of the automation provided.”

 

Given the huge costs for Hadoop carrying a data warehouse workload, I asked Bob if he agreed with Richard’s assessment. “Does it pass the sniff test?” I asked. Bob Page replied:

“We don’t see anybody today trying to build an EDW with Hadoop. This is a capability issue not a cost issue. Hadoop is not a data warehouse. Hadoop is not a database. Comparing these two for an EDW workload is comparing apples to oranges. I don’t know anybody who would try to build an EDW in Hadoop. There are many elements of the EDW on the technical side that are well refined and have been for 25 years. Things like workload management, the way concurrency works, and the way security works -- there are many different aspects of a modern EDW that you are not going to see in Hadoop today. I would not see these two as equivalent. So –no– it doesn’t pass the sniff test.”

Bob’s point – in my opinion – is the Hadoop-as-EDW cost model is invalid since Hadoop is not designed to handle EDW workloads.   Richard said he “gave Hadoop the benefit of the doubt” but I suspect the comparison was baked into his consulting contract with the Marketing CMO woman.  Ultimately, Richard and Bob agree from different angles.

There are a lot of press articles and zealots on the web who will argue these results.  But Richard and Bob have the hands-on credentials far beyond most people.  They have worked with dozens of big data implementations from 500TB to 10s of petabytes.  Please spend the time to listen to their webinar for an unbiased view.  The biased view – me – didn’t say all that much during the webinar.

Many CFO’s and CMO’s are grappling with the question “When do we use Hadoop and when should we use the data warehouse?”  Pass them the webinar link, call Richard, or call Bob.

 

Total Cost of Data Webinar

Big Data—What Does It Really Cost? (white paper)

The Real Cost of Big Data (Spreadsheet)

TCOD presentation slides (PDF)

Big Apple Hosts the Final Big Analytics Roadshow of the Year

Posted on: November 26th, 2013 by Teradata Aster No Comments

 

Speaking of ending things on a high note, New York City on December 6th will play host to the final event in the Big Analytics 2013 Roadshow series. Big Analytics 2013 New York is taking place at the Sheraton New York Hotel and Towers in the heart of Midtown on bustling 7th Avenue.

As we reflect on the illustrious journey of the Big Analytics 2013 Roadshow, kicking off in San Francisco, this year the Roadshow traveled through major international destinations including Atlanta, Dallas, Beijing, Tokyo, London and finally culminating at the Big Apple – it truly capsulated the appetite today for collecting, processing, understanding and analyzing data.

Big Analytics Atlanta 2013 photo

Big Analytics Roadshow 2013 stops in Atlanta

Drawing business & technical audiences across the globe, the roadshow afforded the attendees an opportunity to learn more about the convergence of technologies and methods like data science, digital marketing, data warehousing, Hadoop, and discovery platforms. Going beyond the “big data” hype, the event offered learning opportunities on how technologies and ideas combine to drive real business innovation. Our unyielding focus on results from data is truly what made the events so successful.

Continuing on with the rich lineage of delivering quality Big Data information, the New York event promises to pack tremendous amount of Big Data learning & education. The keynotes for the event include such industry luminaries as Dan Vesset, Program VP of Business Analytics at IDC, Tasso Argyros, Senior VP of Big Data at Teradata & Peter Lee, Senior VP of Tibco Software.

Photo of the Teradata Aster team in Dallas

Teradata team at the Dallas Big Analytics Roadshow


The keynotes will be followed by three tracks around Big Data Architecture, Data Science & Discovery & Data Driven Marketing. Each of these tracks will feature industry luminaries like Richard Winter of WinterCorp, John O’Brien of Radiant Advisors & John Lovett of Web Analytics Demystified. They will be joined by vendor presentations from Shaun Connolly of Hortonworks, Todd Talkington of Tableau & Brian Dirking of Alteryx.

As with every Big Analytics event, it presents an exciting opportunity to hear first hand from leading organizations like Comcast, Gilt Groupe & Meredith Corporation on how they are using Big Data Analytics & Discovery to deliver tremendous business value.

In summary, the event promises to be nothing less than the Oscars of Big Data and will bring together the who’s who of the Big Data industry. So, mark your calendars, pack your bags and get ready to attend the biggest Big Data event of the year.

 

About one year ago, Teradata Aster launched a powerful new way of integrating a database with Hadoop. With Aster SQL-H™, users of the Teradata Aster Discovery Platform got the ability to issue SQL and SQL-MapReduce® queries directly on Hadoop data as if that data had been in Aster all along. This level of simplicity and performance was unprecedented, and it enabled BI & SQL analysts that knew nothing about Hadoop to access Hadoop data and discover new information through Teradata Aster.

This innovation was not a one-off. Teradata has put forward the most complete vision for a data and analytics architecture in the 21st century. We call that the Unified Data Architecture™. The UDA combines Teradata, Teradata Aster & Hadoop into a best-of-breed, tightly integrated ecosystem of workload-specific platforms that provide customers the most powerful and cost-effective environment for their analytical needs. With Aster SQL-H™, Teradata provided a level of software integration between Aster & Hadoop that was, and still is, unchallenged in the industry.

 

Teradata Unified Data Architecture™ image

Teradata Unified Data Architecture™

Today, Teradata makes another leap in making its Unified Data Architecture™ vision a reality. We are announcing SQL-H™ for Teradata, bringing the best SQL engine for data warehousing and analytics to Hadoop. From now on, Enterprises that use Hadoop to store large amounts of data will be able to utilize Teradata's analytics and data warehousing capabilities to directly query Hadoop data securely through ANSI standard SQL and BI tools by leveraging the open source Hortonworks HCatalog project. This is fundamentally the best and tightest integration between a data warehouse engine and Hadoop that exists in the market today. Let me explain why.

It is interesting to consider Teradata's approach versus alternatives. If one wants to execute SQL on Hadoop, with the intent of building Data Warehouses out of Hadoop data, there are not many realistic options. Most databases have a very poor integration with Hadoop, and require Hadoop experts to manage the overall system - not a viable option for most Enterprises due to cost. SQL-H™ removes this requirement for Teradata/Hadoop deployments. Another "option" are the SQL-on-Hadoop tools that have started to emerge; but unfortunately, there are about a decade away from becoming sufficiently mature to handle true Data Warehousing workloads. Finally, the approach of taking a database and shoving it inside Hadoop has significant issues since it suffers from the worst of both worlds – Hadoop activity has to be limited so that it doesn't disrupt the database, data is duplicated between HDFS and the database store, and performance of the database is less compared to a stand–alone version.

In contrast, a Teradata/Hadoop deployment with SQL-H™ offers the best of both worlds: unprecedented performance and reliability in the Teradata layer; seamless BI & SQL access to Hadoop data via SQL-H™; and it frees up Hadoop to perform data processing tasks at full efficiency.

Teradata is committed to being the strategic advisor of the Enterprise when it comes to Data Warehousing and Big Data. Through its Unified Data Architecture™ and today's announcement on Teradata SQL-H™, it provides even more performance, flexibility and cost-effective options to Enterprises eager to use data as a competitive advantage.

Big Insights from Big Analytics Roadshow

Posted on: January 25th, 2013 by Teradata Aster No Comments

 

Last month in New York we completed the 4th and final event in the Big Analytics 2012 roadshow. This series of events shared ideas on practical ways to address the big data challenge in organizations and change the conversation from “technology” to “business value”. In New York alone, 500 people attended from across both business and IT and we closed out the event with two speaker panels. The data science panel was, in my opinion, one of the most engaging and interesting panels I’ve ever seen at an event like this. The topic was on whether organizations really need a data scientist (and what’s different about the skill set from other analytic professionals). Mike Gualtieri from Forrester Research did a great job leading & prodding the discussion.

Overall, these events were a great way to learn and network. The events had great speakers from cutting-edge companies, universities, and industry thought-leaders including LinkedIn, DJ Patil, Barnes & Noble, Razorfish, Gilt Groupe, eBay, Mike Gualtieri from Forrester Research, Wayne Eckerson, and Mohan Sawhney from Kellogg School of Management.

As an aside, I’ve long observed that there has been a historic disconnect between marketing groups and the IT organizations and data warehouses that they support. I noticed this first when I worked at Business Objects where very few reporting applications ever included Web clickstream data. The marketing department always used a separate tool or application like Web Side Story (now part of Adobe) to handle this. There is a bridge being built to connect these worlds – both in terms of technology which can handle web clickstream and other customer interactional data, but also new analytic techniques which make it easier for marketing/business analysts to understand their customers more intimately and better serve them a relevant experience.

We ran a survey at the events, and I wanted to share some top takeaways. The events were split into business and technical tracks with themes of “data science” and “digital marketing”. Thus, the survey data compares the responses from those who were more interested in technology than the business content, so we can compare their responses. The survey data includes responses from 507 people in San Francisco, 322 in Boston, 441 in Chicago, and 894 in New York City for a total of 2164 respondents.

You can get the full set of graphs here, but here are a couple of my own observations / conclusions in looking at the data:

1)      “Who is talking about big data analytics in your organization?” - IT and Marketing were by far the largest responses with nearly 60% of IT organizations and 43% of marketing departments talking about it. New York had slightly higher # of CIO’s and CEO’s talking about big data at 23 and 21%, respectively

 Survey Data: Figure 1

 

 

 


 

 

 

 

 

 

 

2)      “Where is big data analytics in your company” - Across all cities, “customer interactions in Web/social/mobile” was 62% - the biggest area of big data analytics. With all the hype around machine/sensor data, it was surprisingly only being discussed in 20% of organizations. Since web servers and mobile devices are machines, it would have been interesting to see how the “machine generated data” responses would have been if we had taken the more specific example of customer interactions away

 Survey Data: Figure 2

 

 

 

 


 

 

 

 

 

 

3)      This chart is a more detailed breakdown of the areas where big data analytics is found, broken down by city. NYC has a few more “other.” Some of the “other” answers in NYC included:

  1. Claims
  2. Client Data Cloud
  3. Development, and Data Center Systems
  4. Customer Solutions
  5. Data Protection
  6. Education
  7. Financial Transaction
  8. Healthcare data
  9. Investment Research
  10. Market Data
  11.  Predictive Analytics (sales and servicing)
  12. Research
  13. Risk management /analytics
  14. Security

 Survey Data: Figure 3

 

 

 

 

 

 


 

 

 

 

4)      “What are the Greatest Big Analytics Application Opportunities for Businesses Today? – on average, general “data discovery or data science” was highest at 72%, with “digital marketing optimization” as second with just under 60% of respondents. In New York, “fraud detection and prevention” at 39% was slightly higher than in other cities, perhaps tied to the # of financial institutions in attendance

 Survey Data: Figure 4

 


 

 

 

 

 

 

 

 

 

In summary, there are lots of applications for big data analytics, but having a discovery platform which supports iterative exploration of ALL types of data and can support both business/marketing analysts as well as savvy data scientists is important. The divide between business groups like marketing and IT are closing. Marketers are more technically savvy and the most demanding for analytic solutions which can harness the deluge of customer interaction data. They need to partner closely with IT to architect the right solutions which tackle “big analytics” and provide the right toolsets to give the self-service access to this information without always requiring developer or IT support.

We are planning to sponsor the Big Analytics roadshow again in 2013 and take it international, as well. If you attended the event and have feedback or requests for topics, please let us know. I hear that there will be a “call for papers” going out soon. You can view the speaker bios & presentations from the Big Analytics 2012 events for ideas.

Announcing Teradata Aster Big Analytics Appliance

Posted on: October 17th, 2012 by Teradata Aster No Comments

 

“Big data” has always been a favorite subject of discussion among the Aster Data team. We've been talking about big data at least since 2009, long before the term became burning-hot. The big data hype has confused many organization (and vendors) in the market about the best technology or method to solve their analytical business problems.

However, our vision hasn't changed: from the time we founded the company in 2005 to today where we are part of the Teradata family. Teradata Aster continues to lead the market with technology innovations and reference architectures which provide clear guidance and deliver significant business value to our customers

Today, we are pushing the limits of analytical technology once more, by launching the Teradata Aster Big Analytics Appliance. The Big Analytics Appliance is a unique machine that can help enterprises see their business in high-definition. By harnessing all existing and new data types in the enterprise, we enable organizations to leverage our powerful SQL-MapReduce framework and business-ready analytics & apps which solve specifics business problems in marketing attribution, fraud detection, graph analysis, pattern analysis, and much more. It unleashes the creativity of bright analysts to go discover new insights to help their organizations grow revenue and create sustainable competitive advantage.

So what is the Big Analytics Appliance? It's five things in one box:

  1. Aster + Apache Hadoop (100% open source via the Hortonworks HDP distribution), fully integrated in one box
  2. ANSI-standard SQL and next-generation MapReduce, fully integrated
  3. More than 50 ready-to-use MapReduce  apps, to deliver immediate business value
  4. Full ecosystem connectivity for both Aster and Hadoop; with BI, ETL and other existing IT systems
  5. The latest-generation, most efficient hardware platform, specifically optimized for Aster, Hadoop, and Big Analytics

Loyal to our Stanford roots, the appliance comes in Cardinal-red color!

Teradata Aster Big Analytics Appliance

The Big Analytics Appliance packs a long list of essential and unique technologies, including:

  • SQL-MapReduce®,  industry's only true SQL/MapReduce integration
  • SQL-H™, industry's only ANSI-standard SQL and Hadoop integration
  • Teradata Viewpoint, the most advanced database monitoring platform now extended to Aster and Hadoop
  • Teradata TVI a very sophisticated hardware support and failure prevention software, now ported to Hadoop as well as to Aster
  • Infiniband network interconnect - makes ultra-high-performance connectivity between Aster and Hadoop, as well as scalability, a non-issue
  • Small factor disk drives and dense enclosures - make this appliance one of the most dense and space-efficient big data platforms in the market

And, of course, everything in this appliance is packaged, integrated, pre-tested and supported by Teradata - the most trusted brand in data management and analytics.

I also want to take a moment to talk about our Unified Data Architecture vision for the enterprise. When most vendors out there talk about big data at a very high level without explaining where it fits and how it relates with traditional technologies like data warehousing, we decided to do the hard work of figuring out how different technologies complement each other and for what purpose. The result of that was the diagram below that showcases how Teradata, Aster & Hadoop can work together in tandem to provide a complete data solution for enterprise environments:

 

Teradata Unified Data Architecture

We also went one step further and now have a matrix that explains what technology (or technologies) are more appropriate for what use case - given a workload/use case and a specific type of data. The result of that exercise is below:

 

 

 

 

 

Processing as a Function of Schema Requirements by Data Type

When To Use Which Technology? The best approach by workload and data type

If you want to know more about our Unified Data Architecture vision, read the whitepaper we co-authored with Hortonworks, or feel free to contact us and we'll be happy to discuss with you this concept and how it'd fit into your environment.

Through tightly integrating Aster and Hadoop, the new Big Analytics Appliance addresses a large part of the Unified Data Architecture; and via the Teradata-Aster and Teradata-Hadoop connectors, Teradata now has all the necessary pieces to help enterprises extract the maximum business value from all their data and execute on their Big Data vision. At Aster, just like at Teradata, we are committed to continuously provide the best innovations to help our customers have the power to make the best decision possible.

P.S. If you want to try out Aster without ordering a full Aster box, we now allow you to download an Aster virtual appliance! Go give it a try: http://www.asterdata.com/AsterExpress