Regulating Data Lake Temperature

By Mark Cusack, Chief Architect, Teradata RainStor

One of the entertaining aspects of applying physical analogies to data technology is seeing how far you can push the analogy before it falls over or people get annoyed.  In terms of analogical liberties, I’d suggest that the data lake occupies the number one spot right now.  It’s almost mandatory to talk of raw data being pumped into a data lake, of datamarts drawing on filtered data from a lakeside location, and of data scientists plumbing the data depths for statistical insight.

This got me thinking about what other physical processes affecting real lakes I could misappropriate.  I am a physicist, so I’ll readily misuse physical phenomena and processes to help illustrate logical processes if I think I can get away with it.  There are two important processes in real lakes that are worth bending out of shape to fit our illustrative needs. These are stratification and turnover.

Data Stratification

Let’s look at stratification first.  During the summer months, the water at the surface of a proper lake heats up, providing a layer of insulation to the colder waters below, which results in layers of water with quite distinct densities and temperatures.  Right away we can adopt the notion of hot and cold data as stratified layers within our data lake.  This isn’t a completely terrible analogy, as the idea of data temperature based on access frequency is well established, and Teradata has been incorporating hot and cold running data storage into its Integrated Data Warehouse for a while now.

Storing colder data is something we’re focused on at Teradata RainStor too.  One of RainStor’s use cases involves offloading older, colder data from a variety of RDBMS in order to buy back capacity from those source systems.  RainStor archives the low temperature data in a highly compressed – dense – form in a data lake, while still providing full interactive query access to the offloaded data. In this use case, RainStor is deployed in a secondary role behind one or more primary RDBMS.  Users can query this cold layer of data in RainStor directly via RainStor’s own parallel SQL query engine.  In addition, Teradata Integrated Data Warehouse users are able to efficiently query data stored in RainStor running on Hadoop via the Teradata® QueryGrid™.

Increasingly, however, RainStor is being deployed on a data lake as more than just an archive for cold data.  It’s being deployed as the system-of-record for structured data – as the primary repository for a mix of data of different temperatures and from different sources, all stored with original fidelity.   The common feature of this mixed data is that it doesn’t change, and so it fits in well with RainStor’s immutable data model, which can store and manage data on Hadoop and also on compliance-oriented WORM devices.

Data Turnover

The mixing of the data layers in the system-of-record use case is analogous to the turnover process in real lakes.  In winter months the upper layers of water cool and descend, displacing deeper waters to cause a mixing or turnover of the lake.  The turnover process is important in a watery lake as it mixes oxygen-poor water lower down with oxygen-rich surface water, supporting the ecosystem at all lake depths.

The lack of data stratification in a data lake is also important since one data scientist’s cold data is another one’s hot data.  By providing the same compression, SQL query, security and data life-cycle management capabilities to all data stored in RainStor, a data scientist pays no penalty for accessing the raw data in whatever way they choose to, be it through RainStor’s own SQL engine, Hive, Pig, MapReduce, HCatalog, or via the QueryGrid.

I’ve stretched the data lake metaphor to its limits in this post. The serious point is that data lakes are no longer seen as being supplied from a single operational source, as per the original definition.  They may be fed from a range of sources, with the data itself varying in structure.  Not only is schema flexibility a requirement for many data scientists, so too is the need for equally fast access to all data in the lake, free from the data temperature prejudices that might exist in upstream systems.

 

Mark Cusack, Teradata RainStorMark Cusack joined Teradata in 2014 as part of its RainStor acquisition. As a founding developer and Chief Architect at RainStor, he has worked on many different aspects of the product since 2004.  Most recently, Mark led the efforts to integrate RainStor with Hadoop and with Teradata. He holds a Masters in computing and a PhD in physics.

Leave a Reply

Your email address will not be published. Required fields are marked *


*