The Data Lake De-Mystified

Posted on: August 11th, 2014 by Ben Davis 2 Comments

“Data Lake” ---By now most of you will have heard the term. You most definitely would have read about it if you follow multiple big data bloggers and/or websites. But what image does it conjure up? Some type of body of water is what comes to my mind. But what does it really mean from an IT perspective? Let’s look at its origins.

Pentaho CTO James Dixon is credited with coining the term "data lake".  As he described it in a blog entry, "If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption. Translate this into the data version of the term and the contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

So with that understood how has the concept of a data lake come about? The data lake arose because new types of data needed to be captured and exploited by the Enterprise. Today many different and varied data types ranging from video, sound, structured, docs, sensor data  are being created. Therefore we need a way to bring it all together and generate some meaning out of it.

But because there is so much data we need to do this at low cost. Imagine the cost of storing the huge volumes of data we have in today’s environment even just ten years ago? Most probably it would have been the single most expensive capex item on an organisation’s balance sheet.

 Use a Data Lake for low cost storage and ETL functions whilst executing analytical operations on the EDW for superior performance

So isn’t a data lake just Hadoop?

The answer is yes and no. Hadoop is obviously the most logical choice because it provides a cost-effective and technologically feasible way to meet big data challenges. It also offers a range of features designed to meet the requirements of a data lake which I will discuss later. However it is not the only solution. A well-defined and architected traditional data warehouse can also serve the purpose of a data lake. In fact, if performance was critical, then a data warehouse might be the better choice.

I’m a bit confused…So why would I need both a data lake and a data warehouse?

Organisations that have spent many years investing in a data warehouse are also asking the same question. The answer is that both the data lake and the data warehouse have their strengths and weaknesses. But they work best when used together. Drawing on the strengths of each solution, only then does an organisation derive real insights and manage their data effectively.

So if we look at the important dimensions comparing an enterprise data warehouse and a data lake, we can start to identify the sweet spot of each solution and understand how the two can interact together. The following table neatly summaries the strengths and capabilities of each solution.

What about other benefits of a data lake?

Apart from cheaper storage of large volumes of data, there are benefits to users. The data lake gives business users immediate access to all data. They don't have to wait for the data warehousing (DW) team to model the data or give them access. Rather, they shape the data however they want to meet local requirements. The data lake speeds delivery and offers unparalleled flexibility since nobody or nothing stands between business users and the data. However, for this to occur, you do need fairly experienced users to be able to perform these functions. Something to keep in mind if you plan to give users direct access to raw data.

Secondly, the data lake can contain any type of data: clickstream, machine-generated, social media, and external data, and even audio, video, and text. Traditional data warehouses are limited to structured data. The data lake can hold any type of data.

Thirdly, with Teradata’s Query Grid technology, you can use the data lake as the permanent storage facility and not have to move data into different silos of the enterprise to perform more strategic analytical work that the data lake cannot perform. You simply execute the query from the Teradata EDW or Teradata Aster platforms and ask for the query to be the executed source of where the data resides. If all of the data resides in your data lake, you get the best of both worlds with low cost storage of the data lake and high performance analytics of the data warehouse while still controlling access and presentation.

The following diagram represents the groundbreaking QueryGrid technology introduced in Teradata 15.

The data lake as an ETL environment

We are finding that the data lake has a great use as an ETL environment. This is primarily because of its ability to store and process data at low cost. There are also many different methods for transforming data which makes the data lake an ideal place to massage the data before it heads off onto your organisation’s data warehouse. This sort of “scale-out ETL” allows big data to be distilled into a form that can then be utilised by users who don’t have the skills to work with the raw data.

The risks of the data lake?

Whilst the data lake may sound like the best thing since sliced bread, you have to remind yourself that the concept is still in its infancy. Apache Hadoop was conceived as a tool for developers, who were more concerned with the power to transform and analyse data and create applications at low cost and high-scale than with anything else. Therefore there are some areas for improvement especially in the security and auditing requirements. These include posing questions such as:

  • How secure is the data?
  • How is access controlled?
  • What auditing trails exist?

So in summary, data lakes are only beginning to capture the attention of the Data Architects in your organisation. They are complementary to any overall data management initiative and should be used in conjunction with your traditional data warehouse. The benefits are there, but you need to carefully plan a data lake with an overall strategy. Just because you can store any type of data at low cost in a single solution, doesn’t mean there are no pitfalls. I haven’t yet noticed any large organisation decide  to take on the full concept of a data lake, so at this stage explore at your own peril.

Probably the best approach is to start small and use a small Hadoop cluster to store a particular set of data types and  grow that over time. Explore the use of QueryGrid with your Enterprise Data Warehouse and look to use each platform in your environment for its strengths.

A Gift for You… A brighter future for a friend!

Refer a friend for one of our sales, marketing and techie roles and if they are successful… you receive a $1000 gift card to buy yourself a treat! Click here for more details.

Ben Davis is a Senior Architect for Teradata Australia based in Canberra. With 18 years of experience in consulting, sales and technical data management roles, he has worked with some of the largest Australian organisations in developing comprehensive data management strategies. He holds a Degree in Law, A post graduate Masters in Business and Technology and is currently finishing his PhD in Information Technology with a thesis in executing large scale algorithms within cloud environments.

2 Responses

  1. Daniel Mannino

    August 12, 2014

    Ben,
    good article. I have one technical question and one question related to data management.
    For the technical side, if the data lake is just a place where I can store raw files coming from the transactional system, what is the difference between the data lake and a cluster file system? GlusterFS, for example, is cheaper than Hadoop (less data redundancy, no name node bottleneck, no 64MB block size, …)

    From the data management point of view, we need to have strong metadata layer to make the data lake effective. The data like should help the end user finding and understanding the content of the lake. Which metadata solution would you use?
    Thanks,
    Daniel

    Reply
  2. Ben Davis

    August 14, 2014

    Hi Daniel. The differences between GlusterFS and Hadoop have really blurred over the past 2 years. So much so that there is no technical limitation to using Gluster as your data lake. But regardless of whether you have a Teradata EDW or another EDW, the interoperability must still be considered as you will need to extract data out of the lake into an analytics platform neatly. The way we are moving is that this movement must be seamless and fast. We talk about Hadoop a lot because of the SQL-H support that’s built into Teradata and our Big Data platforms that come with Hadoop nodes built in allowing transfer of data over Infiniband. So really any low cost storage platform with the connectors to a data analytics environment is sufficient to act as a data lake.

    For Metadata within the data lake I can’t go past HCatalog. HCatalog allows you to create, edit, and expose (via a REST API) metadata or table definitions. Keep in mind that the types of users that act directly against Hadoop should be highly skilled. In effect you are allowing them open access to the lake with toolsets and effectively giving them open slather across the data.

    Reply

Leave a comment

*