By Cesar Rojas
It’s no secret that today’s increasingly savvy organizations are looking to Apache™ Hadoop®-based data platforms to solve some of their big data challenges. It is the promise of new levels of scalability, data and processing flexibility and their ability to handle a variety of data sources and formats that sets it apart from other platforms struggling to succeed in those same areas. In kind, Hadoop-based data lakes have also gone through a transformation, emerging as a popular use case for many early adopters looking for a single, centralized place to store data from disparate sources that can be easily built and then made available to multiple analytical processing capabilities.
Still, with all of the new technologies, their promises and their seemingly impressive capabilities, the hype still outweighs the reality in many cases. Data professionals are overwhelmed with data governance, metadata management and data quality. While these issues continue to grow in importance to most organizations, so does their ability to solve for their complexity. This growth in complexity is acting to veto companies’ ability to further adopt emerging data platforms, such as Hadoop-based data lakes, into big data architectures.
But there is an option. An integrated data management solution such as Teradata Loom® can ensure rapid, quick access to high quality integrity data. Loom enables users, such as data analysts and data scientists, to easily find, access and understand data in Hadoop. They can now sharply reduce the time spent on data preparation and metadata tagging and quickly start with exploratory analysis that leads to business insights.
Preparing Your Data in the Data Lake
Teradata Loom’s architecture model is flexible. Imagine being able to no longer have to start data preparation from scratch. Loom makes this possible by allowing for tracking all of the data assets across Hadoop and then maintains the relationships both among original data sets and the data sets that are being created from derivations of the original data sets. This creation of a data lineage that is trackable is key to increasing productivity and overall success for data preparation. With Loom, users can captures metadata on partitioned tables and containers in Apache Hive™ for efficient processing. This ability to organize and manage diverse data sets simplifies and amplifies the productivity of any user who wants to begin analyzing their data.
Critical Success Factor: Exponential Growth in Data Variety
In the early days of Hadoop, volume, velocity and variety were touted as differentiators. When these 3 V’s where originally conceptualized no one expected data variety to be a radically changing element, in fact we can say that in today’s terms datasets were pretty homogeneous. Fast forward to today and, surprisingly, there is now a variety of data unlike ever before. Manually profiling these different datasets is a recipe for data lake failure, often turning projects into a “data swamp” where data is unknown and non-actionable.
Teradata Loom offers ActiveScan technology – the ability to quickly and dynamically profile datasets and collect statistics about the data stored within a Hadoop cluster. This information is then rendered to the user – often a data scientist – giving them an accurate sense of what is stored inside the datasets. Having the ability to see what attributes exist, what types of values are present within the attributes, and what statistical profiles of each value looks like is a huge productivity gain. Even better, ActiveScan automatically detects when new data is registered in HDFS or Hive, and performs this profiling without user intervention – more gains in productivity!
Workbench Success & Data Wrangling
The Workbench might be the area that I think sets Loom apart the most – it’s a key differentiator for us. The Teradata Loom Workbench presents a graphical representation of your data and is built using HTML5 and CSS. It communicates with the Loom server using the RESTful API over which data scientists can now programmatically access information. Data accessed via this API is encoded in JSON.
If professionals were able to spend more time analyzing data rather than preparing it to be analyzed, productivity and business impact could be significantly increased. With Teradata Loom’s “data wrangling” capabilities, this productivity and ability to analyze in a speedy and accurate way are possible. These capabilities are built-in and significantly simplify data preparation. Data Wrangling, simply put, allows you to wrangle the data and see what’s happening to it. It enables highly exploratory, iterative interactions with the datasets allowing quick and meaningful data preparation for statistical analysis.
Time, as they say, is money. The goal in each of these areas is decreasing the amount of time it takes data scientists, data analysts – users in general – to preparer their data for meaningful data analysis. Instead of taking a top-down approach to your data, Teradata Loom enables a bottom-up approach, which plays directly to the strengths of the Hadoop platform.
Why wait one more day to increase your productivity and your ability to get meaningful analysis from your data analytics?
Teradata Loom Community Edition is a free to download version on a virtual machine and can be run on your Laptop. It’s simple, it’s fast and it’s powerful. Experience a new world of metadata management, data lineage and data preparation. Get started today!
Download the free Teradata Loom Community Edition here Teradata.com/tryloom