An esteemed member of our Canberra office the other day said the term “data spaghetti”. And I thought to myself what a very clear and simple message to describe the problem in only two words. So it got me thinking …Can we relate data warehousing and analytics to that Italian cuisine exported around the world?
Delving a bit deeper into this concept I’ve been able to find I can make a very strong connection between the world of data warehousing/analytics to cooking in the kitchen. Here are a few similarities:
Variety– Just like there are many shapes and forms of pasta to confront and confuse you at the supermarket, so too is there data variety. Remember the days when there were only a few flavours of ice cream? Data was similar because most of it was managed in rows and columns. But we’ve changed dramatically and now are faced with varied data types from images, sound, video through to social media feeds and the Internet of Things.
Volume– My theory is to cook as much pasta as possible because I know it will always get devoured in no time. The same concept may apply to your data analytics approach, boil so much data and get the outputs because you know those outputs will be consumed by the end user or customer. But are you overdoing it? Could you be more efficient with how you treat the volumes of data? Check out Reference Information Architecture for a few ideas.
Raw, Al-dente or well done– I was nearly stumped on a comparison of raw pasta and data until I made the obvious connection. Most (or nearly all) people don’t like to consume raw pasta. So this is just like data right? Most users don’t like raw data because it’s just too hard to take in and synthesize. However some people like data scientists like raw data as it’s in its natural state ready for exploring. Most users will like their data somewhat filtered, cleansed, matched and organised, hence the relation to data being al-dente (still has a bit of bite to it) or well cooked.
Is this your data issue?
The Data spaghetti– This is the obvious analogy. Data spaghetti is caused when you have so many data stores in your environment that trying to find the start and end of a single strand becomes near impossible. If you’re doing data analytics today and making operational decisions based on the outputs, ask yourself “How do I trust the data that went into this decision?” If I was to try to trace back the data in this report, would I be able to get back to the raw original data?
The secret sauce- What makes good pasta? I would say the sauce. It’s the binding agent that brings together one or more different types of pasta into a single consumable meal. So what’s the secret sauce in data? Not hardware, not software, not even the data itself but rather the developers and data scientists who build the insights and bring together disparate datasets.
Of course their workbench in the data kitchen could simply be a small deployment of one of Teradata’s newly announced Cloudera or Hortonworks Hadoop appliances. Or they might be like me in the kitchen where to get the job done properly I need a dizzying array of tools at my disposal. This of course would equate to the UDA context where platforms perform a specific role in the data kitchen, but when used together in an integrated fashion you do produce something truly magical.
But of course we’re all budding MasterChef wannabe’s, so we now get creative with pasta and maybe try new approaches such as foams or gels (Yes I’m still talking food here not hair products!). Take this concept into the data analytics world and the concept of “Fail fast” applies. Tools such as the Teradata data labs gives us that sandpit style environment where we’ll try something new such as a joining two datasets together and see how the data reacts and what new analytics can be gained from such a combo. All the while ensuring we don’t ruin it for everyone and if we do fail, make sure we do it early before we’ve wasted too much time and effort.
But hold on, what about storing all the raw data? Well that would be Hadoop (the equivalent of the kitchen fridge). It holds the raw data and acts just like a fridge in your kitchen at home by providing a data landing zone where it’s easy to land the data in its raw form and then I’ll control how and who gets access to the data fridge. Why I might even want to keep some data longer term once it is aged, that’s where the Hadoop comes in handy for long term “cold” storage of archived data. Essentially a freezer!
Finally if you’re like me and throw everything into the fridge with abandonment and then later on try to work out what is (or was) actually in that plastic bag, then you’ll know where I’m coming from when you wish you had taken the time to write on the bag what it contained. Take this into the data world and that’s the risk of Hadoop. Because it’s so easy to dump everything in without a structure in place for metadata, it makes it real hard when you open the door and you can’t find what you’re looking for. Ah so what is needed is a strong metadata governance approach around your data fridge to ensure it correctly labels the data (automatically if desired) so that when it comes to identifying what’s in the fridge you’ll know what’s what.
So become a budding data analytics MasterChef if you wish. Try new data, new techniques, use the tools available to you and hopefully at the end of it all you’ll have something to savor.
Ben Davis is a Senior Architect for Teradata Australia based in Canberra. With 18 years of experience in consulting, sales and technical data management roles, he has worked with some of the largest Australian organisations in developing comprehensive data management strategies. He holds a Degree in Law, A post graduate Masters in Business and Technology and is currently finishing his PhD in Information Technology with a thesis in executing large scale algorithms within cloud environments.
Latest posts by Ben Davis (see all)
- Mastering colours in your data visualisations - March 8, 2017
- Spotting the pretenders in Data Science - February 15, 2017
- Leveraging all Data in a Government/Client Engagement - November 15, 2016
- Can we defeat DDoS using analytics? - August 15, 2016
- The pitfalls of DIY Hadoop - August 8, 2016