Aster on Hadoop is Hadoop for Everyone!

Posted on: October 19th, 2015 by John Thuma No Comments


One of the biggest announcements at Teradata Partners 2015 is that Aster will run on Hadoop. Many of our customers have already invested in a Hadoop data lake and want to do more than just store data. Storing data is helpful but not all that interesting. What if you could easily do advanced analytics without having to move data out of the lake? What if you had the power of Aster’s multi-genre analytics running on Hadoop? This is exactly what Aster Analytics on Hadoop is all about.

This announcement is a very exciting prospect for some but may strike fear into others. In my blog, I will entertain some of the interesting prospects of bringing together these technologies. I also hope to allay some fears as well.

Aster Brings Multi-Genre Analytics to Hadoop

Almost every day I hear about a new Hadoop project or offering. That means a new approach, a new tool to learn, and usually a lot of programming. With Aster, you have a variety of advanced analytics at your fingertips, ready to take advantage of your data lake. With Aster and its plug-and-play SNAP framework, analysts and data scientists can use a variety of analytics delivered through a common optimizer, executor, and unified interface. Aster offers many different genres of analytics: ANSI SQL, Machine Learning, Text, Graph, Statistics, Time Series, and Path, and Pattern Analysis. Aster on Hadoop is a big win for data scientists, as well as for people who already know and love Aster.

Looks and Feels Just Like Aster

For those who know Aster, Aster on Hadoop might sound daunting, but don’t fret. Everything works the same. You have the same statement interface ‘SELECT * FROM nPath…’ You still have ACT, ncluster_loader, ncluster_export, and Aster Management Console. You can still run ANSI SQL queries and connect to disparate data sources through QueryGrid and SQL-H. AppCenter allows anyone to perform advanced analytics using a simple web interface. Aster Development Environment enables programmers to build their own custom SQL-MR and SQL-GR functions. In other words, everything works the same. The only difference is that it is all running inside Hadoop, enabling a whole new group of people to participate in the Hadoop experience. If you have made a large investment in Hadoop and want to exploit the data located there, then Aster on Hadoop is for you.

Aster on Hadoop: Adaptable Not Invasive

One of the biggest complaints I hear from clients is, “We built a data lake and we want to do analytics, but it’s too hard.” Aster is adaptable to your Hadoop environment and the data you’ve landed there. Aster on Hadoop also means no new appliance; no need to find room in your data center to park a new rack of Aster. There’s no data movement across platforms or across the network; you process data right where it is. Aster on Hadoop runs natively inside Hadoop so you have access to HDFS file formats and a variety of connectors to other JDBC/ODBC compliant data sources. Staff who know ANSI SQL are perfectly positioned to use Aster on Hadoop, and with a little training, they’ll be performing advanced analytics in no time.


Organizations have made huge strides and investments in their Hadoop ecosystem and many are using it as a repository for big data, but that’s not enough. Organizations rightly want to exploit the data contained in Hadoop to gain new insights. Today Aster is being used to solve real world business problems through its multi-genre analytic capabilities. Aster on Hadoop will lower the barriers to entry. It’s a big step in realizing real business value from Hadoop and finally achieving a positive ROI. If you’re an existing Aster client, there’s no need to worry: it all works the same. Teradata Aster on Hadoop democratizes analytics and brings solution freedom to Hadoop! It’s Hadoop for the rest of us.


Teradata Uses Open Source to Expand Access to Big Data for the Enterprise

Posted on: September 30th, 2015 by Data Analytics Staff No Comments


By Mark Shainman, Global Program Director, Competitive Programs

Teradata’s announcement of the accelerated release of enterprise-grade ODBC/JDBC drivers for Presto opens up an ocean of big data on Hadoop to the existing SQL-based infrastructure. For companies seeking to add big data to their analytical mix, easy access through Presto can solve a variety of problems that have slowed big data adoption. It also opens up new ways of querying data that were not possible with some other SQL on Hadoop tools. Here’s why.

One of the big questions facing those who toil to create business value out of data is how the worlds of SQL and big data come together. After the first wave of excitement about the power of Hadoop, the community quickly realized that because of SQL’s deep and wide adoption, Hadoop must speak SQL. And so the race began. Hive was first out of the gate, followed by Impala and many others. The goal of all of these initiatives was to make the repository of big data that was growing inside Hadoop accessible through SQL or SQL-like languages.

In the fall of 2012, Facebook determined that none of these solutions would meet its needs. Facebook created Presto as a high-performance way to run SQL queries against data in Hadoop. By 2013, Presto was in production and released as open source in November of that year.

In 2013, Facebook found that Presto was faster than Hive/MapReduce for certain workloads, although there are many efforts underway in the Hive community to increase its speed. Facebook achieved these gains by bypassing the conventional MapReduce programming paradigm and creating a way to interact with data in HDFS, the Hadoop file system, directly. This and other optimizations at the Java Virtual Machine level allow Presto not only to execute queries faster, but also to use other stores for data. This extensibility allows Presto to query data stored in Cassandra, MySQL, or other repositories. In other words, Presto can become a query aggregation point, that is, a query processor that can bring data from many repositories together in one query.

In June 2015, Teradata announced a full embrace of Presto. Teradata would add developers to the project, add missing features both as open source and as proprietary extensions, and provide enterprise-grade support. This move was the next step in Teradata’s effort to bring open source into its ecosystem. The Teradata Unified Data Architecture provides a model for how traditional data warehouses and big data repositories can work together. Teradata has supported integration of open source first through partnerships with open source Hadoop vendors such as Hortonworks, Cloudera, and MapR, and now through participation in an ongoing open source project.

Teradata’s embrace of Presto provided its customers with a powerful combination. Through Teradata QueryGrid, analysts can use the Teradata Data Warehouse as a query aggregation point and gather data from Hadoop systems, other SQL systems, and Presto. The queries in Presto can aggregate data from Hadoop, but also from Cassandra and other systems. This is a powerful capability that enables Teradata’s Unified Data Architecture to enable data access across a broad spectrum of big data platforms.

To provide Presto support for mainstream BI tools required two things: ANSI SQL support and ODBC/JDBC drivers. Much of the world of BI access works through BI toolsets that understand ANSI SQL. A tool like QlikView, MicroStrategy, or Tableau allows a user to easily query large datasets as well as visualize the data without having to hand-write SQL statements, opening up the world of data access and data analysis to a larger number of users. Having robust BI tool support is critical for broader adoption of Presto within the enterprise.

For this reason, ANSI SQL support is crucial to making the integration and use of BI tools easy. Many of the other SQL on Hadoop projects are limited in SQL support or utilize proprietary SQL “like” languages. Presto is not one of them. To meet the needs of Facebook, SQL support had to be strong and conform to ANSI standards, and Teradata’s joining the project will make the scope and support of SQL by Presto stronger still.

The main way that BI tools connect and interact with databases and query engines is through ODBC/JDBC drivers. For the tools to communicate well and perform well, these drivers have to be solid and enterprise class. That’s what yesterday’s announcement is all about.

Teradata has listened to the needs of the Presto community and accelerated its plans for adding enterprise-grade ODBC/JDBC support to Presto. In December, Teradata will make available a free, enterprise class, fully supported ODBC driver, with a JDBC driver to follow in Q1 2016. Both will be available for download on Teradata.com.

With ODBC/JDBC drivers in place and the ANSI SQL support that Presto offers, anyone using modern BI tools can access data in Hadoop through Presto. Of course, certification of the tools will be necessary for full functionality to be available, but with the drivers in place, access is possible. Existing users of Presto, such as Netflix, are extremely happy with the announcement. As Kurt Brown, Director, Data Platform at Netflix put it, “Presto is a key technology in the Netflix big data platform. One big challenge has been the absence of enterprise-grade ODBC and JDBC drivers. We think it’s great that Teradata has decided to accelerate their plans and deliver this feature this year.”

Enterprise-ready Hadoop, Now Available as an Appliance

Posted on: September 28th, 2015 by Guest Blogger No Comments


By: Clarke Patterson, senior director of product marketing, Cloudera

Early this summer, Teradata and Cloudera jointly announced the Teradata Appliance for Hadoop with Cloudera, an engineered, ready-to-run appliance that comes with enterprise-ready Cloudera Enterprise, in addition to our existing software integrations.

Today, at Strata + Hadoop World at New York, we are excited to announce the ability for customers to now order the Teradata Appliance for Hadoop with Cloudera.

Over the last couple years, we have certainly seen the maturation of Hadoop and the shift from using Hadoop as a proof-of concept technology to an enterprise-ready platform. However, the time, skillsets, and resources needed is hard to come by, and not every organization has the ability to hire the best talents in the market to plan, deploy, and manage Hadoop clusters, let alone support and maintain the platform post-production.

The Teradata Appliance for Hadoop with Cloudera is built to satisfy the need to stand up a Hadoop cluster quickly and cost-effectively. Having an appliance allows organizations to simplify and accelerate the cluster deployment, enabling customers to focus their IT resources on fine-tuning the infrastructure to deliver business value, rather than investing valuable resources in the details of deployment, management, and support of the platform.

In addition to the benefits of an appliance form-factor, the Teradata Appliance for Hadoop with Cloudera also delivers all the benefits of enterprise-ready Hadoop with Cloudera Enterprise:

  • Enterprise security and governance for all mission-critical workloads – With Apache Sentry and Cloudera Navigator, Cloudera Enterprise provides multiple layers of security and governance that are built to maintain the business agility and flexibility that Hadoop provides, while providing the security necessary to meet stringent security regulations and requirements. Being compliance-ready at the core, Cloudera Enterprise is the only distribution that is fully PCI-certified.
  • Industry-Leading Management and SupportCloudera Manager features a best-in-class holistic interface that provides end-to-end system management and zero-downtime rolling upgrades. Combining the power of Cloudera Manager with Teradata Viewpoint and Teradata Vital Infrastructure, Teradata Appliance for Hadoop with Cloudera provides intuitive tools for centralized management with powerful capabilities, even as the system scales.
  • Built on open standards – Cloudera is the leading open source Hadoop contributor, having added more major, enterprise-ready features to the Hadoop ecosystem, not just to the core. Over the years, Cloudera has been working with large ecosystem of partner and development community members to promote open standards for data access and governance through Cloudera Accelerator Program and One Platform Initiatives. With the Apache-licensed open source model, Cloudera ensures that data and applications remain the customer’s, and an open platform to connect with all of their existing investments in technology and skills.

With all the hustle and bustle of Strata + Hadoop World this week, don’t forget to stop by the Cloudera booth and the Teradata booth to talk to us about the Teradata Appliance for Hadoop with Cloudera!

Clarke Patterson, product marketing, ClouderaClarke Patterson is the senior director of product marketing at Cloudera, responsible for  Cloudera’s Platform for Big Data. Clarke joined Cloudera after spending almost three years in a similar role at Informatica. Prior to Informatica he held product management positions at IBM, Informix and Red Brick Systems.  Clarke brings over 17 years of leadership experience to Cloudera having lead teams in product marketing, product management and engineering. He holds a Bachelor of Science degree from the University of Calgary and an MBA from Duke University’s Fuqua School of Business.

The Benefits and Evolution of the Hadoop Appliance

Posted on: July 9th, 2015 by Chris Twogood No Comments


Running Hadoop on an appliance offers significant benefits, but as Hadoop workloads become more sophisticated, so too must the appliance. That’s exactly why we’re releasing the ‘new’ Teradata Appliance for Hadoop 5. Our new appliance has evolved alongside Hadoop usage scenarios while giving IT organizations more freedom of choice to run diverse workloads. Running Hadoop on an appliance makes more sense than ever before.

If you’re running – or thinking about running – Hadoop on an appliance, you’re not alone. According to an ESG survey reported on by SearchDataCenter.com, 21% of IT organizations are considering dedicated analytics appliances. That’s the same percentage of organizations that are considering public cloud solutions and double those considering a public/private hybrid deployment. What is driving the adoption of Hadoop appliances?

5 Key Benefits of Running Hadoop on an Appliance

Organizations that choose to deploy Hadoop on an appliance versus rolling out their own solution realize five important benefits.

  1. Hadoop is delivered ready to run.

We’ve heard industry experts say that it can take IT organizations six to eight months to roll out a Hadoop implementation on their own. With a Teradata appliance, we’ve done all the hard work in terms of installing and configuring multiple types of software as well as installing and configuring the operating system, networking and the like. You simply plug it in, and within days you are up and running.

  1. We’vebuilt high availability into our Hadoop appliances.

The Teradata Vital Infrastructure (TVI) proactively detects and resolves incidents. In fact, up to 72% of all hardware- and software-related incidents are detected and resolved by TVI before the customer even knows about them. We also run BYNET over InfiniBand, which delivers automated network load balancing, automated network failover, redundancy across two active fabrics, and multiple levels of network isolation. These features in Teradata Appliance for Hadoop 5 deliver the high availability IT organizations need in an enterprise-grade solution.

  1. It is Unified Data Architecture ready.

It’s not enough to just efficiently deploy Hadoop. IT organizations must be able to efficiently deploy Hadoop as a seamless part of an interconnected analytics ecosystem. The UDA-ready Hadoop appliance becomes an integral part of the organization’s larger data fabric, with BYNET over InfiniBand interconnect between Hadoop, the Integrated Data Warehouse and Aster big data analytics, and software integration such as QueryGrid, Viewpoint, TDCH, and Smart Loader.

  1. Single vendor support.

An appliance replaces the multiple support contracts IT organizations have with their hardware provider, Hadoop vendor, OS vendor, and various utilities, with a single “hand to shake.” If there’s any problem, one phone call puts you in touch with Teradata’s world-class, 24/7, multi-language support for the entire solution stack. IT organizations are seeing increasing value in this benefit as the Hadoop ecosystem has many moving parts associated with it, and single vendor support provides peace of mind.

  1. Running Hadoop on an appliance lowers your total cost of ownership (TCO)

The cost of a Hadoop appliance includes much more than the hardware the software runs on. There are also costs associated with configuring the network, installing the OS, configuring the disks, installing the Hadoop environment, tuning the Hadoop environment, and testing. The costs for doing all this work internally add up, making the TCO of an appliance even more attractive.

What’s New with Teradata Appliance for Hadoop 5?

In addition to these five benefits, Teradata Appliance for Hadoop 5 delivers freedom of choice to run a variety of workloads. IT organizations now have more options when they run Hadoop on Teradata Appliance 5.

Recognizing that Hadoop workloads are diverse and evolving, Teradata Appliance for Hadoop 5 is available in three flexible configurations, enabling customers to select the configuration that best fits their workloads.

  • Performance configuration. For real-time processing and other workloads that require significant CPU, IO, and memory, we offer the performance configuration. This computational intensive configuration enables organizations to run emerging Hadoop workloads such as streaming, Spark, and SQL on Hadoop. With 24 cores, this configuration has more cores per node. It also has 512TB of RAM, 24 storage disks and 1.2TB drives.
  • Capacity configuration. The capacity configuration allows IT organizations to drive down the cost per terabyte. It is designed for heavy duty, long-running batch jobs as well as long-term archival and storage. It comes with 128- to 256TB RAM and 4TB disk drives.
  • Balance configuration. The balance configuration sits between the performance and capacity configurations, allowing IT organizations to strike the right balance for ETL and analytics jobs. The balance configuration features 24 cores and a 4TB capacity drive.

Learn more about Teradata’s Portfolio for Hadoop.

Hadoop Summit June 2015: 4 Takeaways

Posted on: June 18th, 2015 by Data Analytics Staff No Comments


For those in data—the developers, architects, administrators and analysts who capture, distill and integrate complex information for their organizations—the Hadoop Summit is one of the most important events of the year. We get to talk, share and learn from each other about how we can make Hadoop key to the enterprise data architecture.

The 2015 conference, held this month in San Jose, Calif., lived up to its billing. As a sponsor, Teradata had a big presence, including a booth that provided real-time demonstrations of our data solutions, as well as a contribution to the dialogue, with experts leading informative talks.

  • Peyman Mohajerian and Bill Kornfeld from Think Big  spoke on the new business value of a data lake strategy .
  • Teradata’s Justin Borgman,and Chris Rocca,  explored the future of Hadoop and SQL.

Over the course of the conference some big themes emerged. Here’s our insider look at the top takeaways from the 2015 Hadoop Summit:

1. Have no fear.

Yes, big data is here to stay.  And the opportunities to be gained are too great to let fear of failure guide your organization’s actions. David T. Lin, leader and evangelist of cloud platform engineering for Symantec, summed it up well: “Kill the fear. Haters to the left. Get it started and go.”

2. Take it step by step.

There’s an abundance of paths you can take to use and derive insights from your data.  Start small and scale. Hemal Gandhi, director of data engineering at One Kings Lane, said a good way to do that is to think like a startup, which often runs on innovation and agility. “There are lots of challenges in building highly scalable big data platforms … we took an approach that allows us to build a scalable data platform rapidly.”


3. Use predictive analytics.

Predictive analytics are worth taking the risk because they help uncover an organization’s next-best action to progress toward a goal. Alexander Gray, CTO of Skytree, discussed the benefits of “bigger” data and how those benefits can be quantified—in dollar terms. Because data size is a basic lever for predictive power, Gray said, “increasing business value is achieved by increasing predictive power.”

4. Personalize customer experiences.

Siloed applications combined in the Lambda architecture allow you to give your customers an experience that is tailored their needs. Russell Foltz-Smith, vice president of data platform at TrueCar, said his system allows his company to accurately identify, assess value, predict and prescribe “who, what and where” —giving customers the transparency they’re increasingly demanding. “We need to make everything easily accessible,” Williams said. “We are moving to a contextually aware, intelligent search engine. You have to open it up and let people forage through your data to find what they need.

Were you able to attend the Hadoop Summit or follow it online? What lessons did you take away from the event? Share your top Hadoop Summit insights in the comments below

Regulating Data Lake Temperature

Posted on: June 15th, 2015 by Mark Cusack No Comments


By Mark Cusack, Chief Architect, Teradata RainStor

One of the entertaining aspects of applying physical analogies to data technology is seeing how far you can push the analogy before it falls over or people get annoyed.  In terms of analogical liberties, I’d suggest that the data lake occupies the number one spot right now.  It’s almost mandatory to talk of raw data being pumped into a data lake, of datamarts drawing on filtered data from a lakeside location, and of data scientists plumbing the data depths for statistical insight.

This got me thinking about what other physical processes affecting real lakes I could misappropriate.  I am a physicist, so I’ll readily misuse physical phenomena and processes to help illustrate logical processes if I think I can get away with it.  There are two important processes in real lakes that are worth bending out of shape to fit our illustrative needs. These are stratification and turnover.

Data Stratification

Let’s look at stratification first.  During the summer months, the water at the surface of a proper lake heats up, providing a layer of insulation to the colder waters below, which results in layers of water with quite distinct densities and temperatures.  Right away we can adopt the notion of hot and cold data as stratified layers within our data lake.  This isn’t a completely terrible analogy, as the idea of data temperature based on access frequency is well established, and Teradata has been incorporating hot and cold running data storage into its Integrated Data Warehouse for a while now.

Storing colder data is something we’re focused on at Teradata RainStor too.  One of RainStor’s use cases involves offloading older, colder data from a variety of RDBMS in order to buy back capacity from those source systems.  RainStor archives the low temperature data in a highly compressed – dense – form in a data lake, while still providing full interactive query access to the offloaded data. In this use case, RainStor is deployed in a secondary role behind one or more primary RDBMS.  Users can query this cold layer of data in RainStor directly via RainStor’s own parallel SQL query engine.  In addition, Teradata Integrated Data Warehouse users are able to efficiently query data stored in RainStor running on Hadoop via the Teradata® QueryGrid™.

Increasingly, however, RainStor is being deployed on a data lake as more than just an archive for cold data.  It’s being deployed as the system-of-record for structured data – as the primary repository for a mix of data of different temperatures and from different sources, all stored with original fidelity.   The common feature of this mixed data is that it doesn’t change, and so it fits in well with RainStor’s immutable data model, which can store and manage data on Hadoop and also on compliance-oriented WORM devices.

Data Turnover

The mixing of the data layers in the system-of-record use case is analogous to the turnover process in real lakes.  In winter months the upper layers of water cool and descend, displacing deeper waters to cause a mixing or turnover of the lake.  The turnover process is important in a watery lake as it mixes oxygen-poor water lower down with oxygen-rich surface water, supporting the ecosystem at all lake depths.

The lack of data stratification in a data lake is also important since one data scientist’s cold data is another one’s hot data.  By providing the same compression, SQL query, security and data life-cycle management capabilities to all data stored in RainStor, a data scientist pays no penalty for accessing the raw data in whatever way they choose to, be it through RainStor’s own SQL engine, Hive, Pig, MapReduce, HCatalog, or via the QueryGrid.

I’ve stretched the data lake metaphor to its limits in this post. The serious point is that data lakes are no longer seen as being supplied from a single operational source, as per the original definition.  They may be fed from a range of sources, with the data itself varying in structure.  Not only is schema flexibility a requirement for many data scientists, so too is the need for equally fast access to all data in the lake, free from the data temperature prejudices that might exist in upstream systems.


Mark Cusack, Teradata RainStorMark Cusack joined Teradata in 2014 as part of its RainStor acquisition. As a founding developer and Chief Architect at RainStor, he has worked on many different aspects of the product since 2004.  Most recently, Mark led the efforts to integrate RainStor with Hadoop and with Teradata. He holds a Masters in computing and a PhD in physics.


It happens every few years and it’s happening again. A new technology comes along and a significant segment of the IT and business community want to toss out everything we’ve learned over the past 60 years and start fresh. We “discover” that we’ve been wasting time applying unnecessary rigor and bureaucracy to our projects. No longer should we have to wait three to six months or longer to deliver technical solutions to the business. We can turn these things around in three to six days or even less.

In the mid 1990’s, I was part of a team that developed a “pilot” object-oriented, client-server (remember when these were the hot buzzwords?) application to replenish raw materials for a manufacturing function. We were upending the traditional mainframe world by delivering a solution quickly and iteratively with a small team. When the end users started using the application in real life, it was clear they were going to rely on it to do their jobs every day. Wait, was this a pilot or…? I would come into work in the morning, walk into a special room that housed the application and database servers, check the logs, note any errors, make whatever fixes needed to be made, re-run jobs, and so on.

It wasn’t long before this work began to interfere with my next project, and the end users became frustrated when I wasn’t available to fix problems quickly. It took us a while and several conversations with operations to determine that “production” didn’t just mean “the mainframe”. “Production” meant that people were relying on the solution on a regular basis to do their jobs. So we backtracked and started talking about what kind of availability guarantees we could make, how backup and recovery should work, how we could transition monitoring and maintenance to operations, and so on. In other words, we realized what we needed was a traditional IT project that just happened to leverage newer technologies.

This same scenario is happening today with Hadoop and related tools. When I visit client organizations, a frightening number will have at least one serious person saying something like, “I really don’t think ‘data warehousing’ makes sense any more. It takes too long. We should put all our data in Hadoop and let our end users access whatever they want.” It is indeed a great idea to establish an environment that enables exploration and quick-turnaround analysis against raw data and production data. But to position this approach as a core data and analytics strategy is nothing short of professional malpractice.

The problem is that people are confusing experimentation with IT projects. There is a place for both, and there always has been. Experimentation (or discovery, research, ad-hoc analysis, or whatever term you wish to use) should have lightweight processes and data management practices – it requires prioritization of analysis activity, security and privacy policies and implementation, some understanding of available data, and so on, but it should not be overburdened with the typical rigor required of projects that are building solutions destined for production. Once a prototype is ready to be used on a regular basis for important business functions, that solution should be built through a rigorous IT project leveraging an appropriate – dare I say it – solution development life cycle (SDLC), along with a comprehensive enterprise architecture plan including, yes, a data warehouse that provides integrated, shared, and trusted production data.

An experimental prototype should never be “promoted” to a production environment. That’s what a project is for. Experimentation can be accomplished with Hadoop, relational technology, Microsoft Office, and many other technologies. These same technologies can also be used for production solutions. So, it’s not that “things are done differently and more quickly in Hadoop”. Instead, it’s more appropriate to say that experimentation is different than an IT project, regardless of technology.

Yes, we should do everything we can to reduce unnecessary paperwork and to speed up delivery using proper objective setting, scoping, and agile development techniques. But that is different than abandoning rigor altogether. In fact, using newer technologies in IT projects requires more attention to detail, not less, because we have to take the maturity of the technology into consideration. Can it meet the service level needs of a particular solution? This needs to be asked and examined formally within the project.

Attempting to build production solutions using ad-hoc, experimental data preparation and analysis techniques is like building a modern skyscraper with a grass hut mentality. It just doesn’t make any sense.

Guest Blogger Kevin Lewis is responsible for Teradata’s Strategy and Governance practice. Prior to joining Teradata in 2007, he was responsible for initiating and leading enterprise data management at Publix Super Markets. Since joining Teradata, he has advised dozens of clients in all major industries. 

Taking Charge of Your Data Lake Destiny

Posted on: April 30th, 2014 by Data Analytics Staff No Comments


One of the most interesting areas of my job is having the opportunity to take an active role in helping shape the future of big data analytics. A significant part of this is the maturation of the open source offerings available to customers and how they can help address today’s analytic conundrums. Customers are constantly looking for new and effective ways to organize their data and they want to build the systems that will empower them to be successful across their organizations. But with the proliferation of data and the rise of data types and analytical models, solving this challenge is becoming increasingly complex.

One of the solutions that has become popular is the concept of a data lake. The idea of a data lake emerged when users were creating new types of data that needed to be captured and exploited across their enterprise. The concept is also tied quite closely to Apache Hadoop and its ecosystem of open source projects so, as you can imagine, since two of my main focus areas (big data analytics and Hadoop) are being brought together, this is an area to which I often pay close attention. Data lakes are designed to tackle some of the emerging big data challenges by offering a new way to organize and build the next generation of systems. They provide a cost effective and technologically refined way to approach and solve big data challenges. However, while data lakes are an important component of the logical data warehouse – because they are designed to give the user choices in order to better manage and utilize data within their analytical ecosystem – many users are also finding that the data lake is also an obvious evolution of their current Apache Hadoop ecosystem and their existing data architecture.

Where do we begin? Quite simply, several questions need to be answered before you start down this path. For instance, it’s important to understand how the data lake is related to your existing enterprise data warehouse, how they work together, and quite possibly the most important question is “What best practices should be leveraged to ensure the resulting strategy drives business value?”

A recent white paper written by CITO Research and sponsored by Teradata and Hortonworks, takes a close look at the data lake and provides answers to all of the above questions, and then some. Without giving away too much of the detail, I thought I would capture a few of the points that impress me most in this paper.

In fact, the data lake has come a long way since its initial entry into the big data scene. Its first iteration included several limitations, making it slightly daunting to general users. The original data lakes were batch-oriented, offering very limited abilities for user interaction with the data, and expertise with MapReduce and other scripting and query capabilities were absolutely necessary for success. Those factors, among others, limited its ability to be widely adopted. Today, however, the landscape is changing. With the arrival of Hadoop 2, and more specifically the release 2.1 of Hortonworks, data lakes are evolving. New Hadoop projects bolstered better resource management and application multi-tenancy allowing multiple workloads on the same cluster that enable users from different business units within organizations to effectively refine, explore, and enrich data. Today, enterprise Hadoop is a full-fledged data lake, with new capabilities being added all the time.

While the capabilities of a data lake evolved over the last few years, so did the world of big data. Companies everywhere started creating data lakes to complement the capabilities of their data warehouses but now must also tackle creating a logical data warehouse in which the data lake and the enterprise data warehouse can be maximized individually -- and yet support each other in the best way possible as well.

The enterprise data warehouse plays a critical role in solving big data challenges, and together with the data lake, the possibilities can deliver real business value. The enterprise data warehouse is a highly designed sophisticated system that provides a single version of the truth that can be used over and over again. And, like a data lake, it supports batch workloads. Unlike a data lake, the enterprise data warehouse also supports simultaneous use by thousands of concurrent users performing reporting and analytic tasks.

There are several impressive uses for a data lake and several beneficial outcomes can result. It is very worthwhile to learn more about data lakes and how they can help you to store and process data at low cost. You can also learn how to create a distributed form of analytics, or learn how the data lake and the enterprise data warehouse have started to work together as a hybrid, unified system that empowers users to ask questions that can be answered by more data and more analytics with less effort. To start learning about these initiatives, download our whitepaper here.

By Cesar Rojas - bio link 

Take a Giant Step with Teradata QueryGrid

Posted on: April 29th, 2014 by Dan Graham No Comments


Teradata 15.0 has gotten tremendous interest from customers and the press because it enables SQL access to native JSON data. This heralds the end of the belief that data warehouses can’t handle unstructured data. But there’s an equally momentous innovation in this release called Teradata QueryGrid.

What is Teradata QueryGrid?
In Teradata’s Unified Data Architecture (UDA), there are three primary platforms: the data warehouse, the discovery platform, and the data platform. The huge gray arrows represent data flowing between these systems. A year or two ago, these arrows were extract files moved in batch mode.

Teradata QueryGrid is both a vision and a technology. The vision --simply said-- is that a business person connected to the Teradata Database or Aster Database can submit a single SQL query that joins data together from a second system for analysis. There’s no need to plead with the programmers to extract data and load it into another machine. The business person doesn’t have to care where the data is – they can simply combine relational tables in Teradata with tables or flat files found in Hadoop on demand. Imagine a data scientist working on an Aster discovery problem and needing data from Hadoop. By simply adjusting the queries she is already using, Hadoop data is fetched and combined with tables in the Aster Database. That should be a huge “WOW” all by itself but let’s look further.

You might be saying “That’s not new. We’ve had data virtualization queries for many years.” Teradata QueryGrid is indeed a form of data virtualization. But Teradata QueryGrid doesn’t suffer from the normal limitations of data virtualization such as slow performance, clogged networks, and security concerns.

Today, the vision is translated into reality as connections between Teradata Database and Hadoop as well as Aster Databases and Hadoop. Teradata QueryGrid also connects the Teradata Data Warehouse to Oracle databases. In the near future, it will extend to all combinations of UDA servers such as Teradata to Aster, Aster to Aster, Teradata to Teradata, and so on.

Seven League Boots for SQL
With QueryGrid, you can add a clause in a SQL statement that says “Call up Hadoop, pass Hive a SQL request, receive the Hive results, and join it to the data warehouse tables.” Running a single SQL statement spanning Hadoop and Teradata is amazing in itself – a giant step forward. Notice too that all the database security, advanced SQL functions, and system management in the Teradata or Aster system is supporting these queries. The only effort required is for the database administrator to set up a “view” that connects the systems. It’s self-service for the business user after that. Score: complexity zero, business users one.

Parallel Performance, Performance, Performance
Historically, data virtualization tools lack the ability to move data between systems in parallel. Such tools send a request to a remote database and the data comes back serially through an Ethernet wire. Teradata QueryGrid is built to connect to remote systems in parallel and exchange data through many network connections simultaneously. Wanna move a terabyte per minute? With the right configurations it can be done. Parallel processing by both systems makes this incredibly fast. I know of no data virtualization system that does this today.

Inevitably, the Hadoop cluster will have a different number of servers compared to the Teradata or Aster MPP systems. The Teradata and Aster systems start the parallel data exchange by matching up units of parallelism between the two systems. That is, all the Teradata parallel workers (called AMPs) connect to a buddy Hadoop worker node for maximum throughput. Anytime the configuration changes, the workers match-up changes. This is non-trivial rocket-science class technology. Trust me – you don’t want to do this yourself and the worst situation would be to expose this to the business users. But Teradata QueryGrid does it all for you completely invisible to the user.

Put Data in the Data Lake FAST
Imagine complex predictive analytics using R® or SAS® are run inside the Teradata data warehouse as part of a merger and acquisition project. In this case, we want to pass this data to the Hadoop Data Lake where it is combined with temporary data from the company being acquired. With moderately simple SQL stuffed in a database view, the answers calculated by the Teradata Database can be sent to Hadoop to help finish up some reports. Bi-directional data exchange is another breakthrough in the Teradata Query Grid, new in release 15.0. The common thread in all these innovations is that the data moves from the memory of one system to the memory of the other. No extracts, no landing the data on disk until the final processing step – and sometimes not even then.

Push-down Processing
What we don’t want to do is transfer terabytes of data from Hadoop and throw away 90% of it since it’s not relevant. To minimize data movement, Teradata QueryGrid sends the remote system SQL filters that eliminate records and columns that aren’t needed. An example constraint could be “We only want records for single women age 30-40 with an average account balance over $5000. Oh, and only send us the account number, account type, and address.” This way, the Hadoop system discards unnecessary data so it doesn’t flood the network with data that will be thrown away. After all the processing is done in Hadoop, data is joined in the data warehouse, summarized, and delivered to the user’s favorite business intelligence tool.

Teradata QueryGrid delivers some important benefits:
• It’s easy to use: any user with any BI tool can do it
• Low DBA labor: it’s mostly setting up views and testing them once
• High performance: reducing hours to minutes means more accuracy and faster turnaround for demanding users
• Cross-system data on demand: don’t get stuck in programmer’s work queue
• Teradata/Aster strengths: security, workload management, system management
• Minimum data movement improves performance and reduces network use
• Move the processing to the data

Big data is now taking giant steps through your analytic architecture --frictionless, invisible, and in parallel. Nice boots!