IT people always try to use soothing names for complex propositions (don’t we all love the fluffy Cloud, being Service-Oriented [I sometimes wish that the restaurant sector would adopt this] or promising our customers that we are, above all, Agile?).
The new buzzword is the data lake, which immediately brings to mind visions of calm waters and natural beauty (like the picture above, Lake Marian in Fjordland NZ, taken last time I hiked there).
So, what is a Data Lake?
Simply put, it is about never having to dispose of any data; mainly because it may be useful one day. With Hadoop, you can afford to keep all your data so that at some time in the future, when you really need it, it is all there.
Is this new?
The sceptic may ask: If HDFS is just a File System, surely we could have kept all this data on some other File System before Hadoop?!
Well, yes. But could you easily retrieve it? The big difference between storing your data on, say, a LINUX file directory and storing it on Hadoop is that there are several access methods to the Hadoop data (map-reduce and its SQL derivatives like Hive) while for your LINUX directory you would need to write very complex programs.
Can you really keep everything and retrieve anything?
You got me there…
I have a client that is struggling with this now. The company uses Hadoop to store several Tb of data that don’t have a natural place anywhere else. A group of users would like to be able to query the data many times during the day. The problem is that Hadoop does not include an advanced query optimiser. It also does not support indexes. So queries that would take seconds on a decent RDBMS take up to 30 minutes on the Hadoop cluster.
So are you saying that the Data Lake is not a good idea?
Not at all! Your Data Lake must be part of your Information Architecture. You have to think about what information you need to store, how you plan to retrieve it and therefore where is the best place to store it.
So, before diving into your Data Lake adventure:
- Ensure that your Data Lake is part of a robust Enterprise Information Strategy.
- Use best practice advice to ensure that your approach is robust.
Ben Bor is a Senior Solutions Architect at Teradata ANZ, specialist in maximising the value of enterprise data. He gained international experience on projects in Europe, America, Asia and Australia. Ben has over 30 years’ experience in the IT industry. Prior to joining Teradata, Ben worked for international consultancies for about 15 years and for international banks before that. Connect with Ben Bor via Linkedin.