Recently I claimed, "Big Data… is rather a nebulous and ambiguous concept". Why do I feel that way? And what then is the significance of the "big data" movement?
If we can only define it to three orders of magnitude then Houston, we have a problem
Unfortunately, "big data" is fast-becoming one of those over-hyped buzz words that our industry is so fond of, so that increasingly it means whatever the person speaking to you wants it to mean. If you look at the Wikipedia definition, for example, "big data" is described as: "a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time". I don't think that this is a very useful definition at all. For example, what is a "tolerable elapsed time"? If we're talking about a security, emergency services or currency trading application, then it might reasonably be measured in seconds or even milliseconds; but if we're talking about a complex behavioural segmentation analysis to support a presentation for next week's Board meeting, then so long as the analysis is waiting for me when I come back from Starbucks after lunch, that's probably "tolerable". The same definition then goes on to say that: "big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set." But since three orders of magnitude separates "a dozen terabytes" from "many petabytes", I'm not sure that this tells us anything very useful, either.
In the same Wikipedia article there is a list of examples of "big data". I think that this list is far more interesting and far more revealing, because, with one exception, all of the types of data listed have one thing in common, which is that they are non-relational and can't easily be squeezed into the rows and columns of a "traditional" SQL DBMS. I think that this tells us something really profound, which is that digitization, the Internet and sensor data technology are generating new types of data – images, voice recordings, text, social media posts – that are potentially incredibly valuable sources of insight in business, but which are mostly left on the data center floor right now, because we have lacked the tools, technology and expertise to deal with them appropriately and to exploit them for analysis. Sometimes these data sets are "vast" and justify the "big" label – and sometimes they're not. But in all cases, what are most interesting about them are actually not their size, but their nature – and what that implies for how we manage and exploit them.
More generally, I think that the idea that "big data" has come to represent is that major organizations simply cannot continue to compete in global markets without mature information management practices. High School kids in California are now being taught about advanced analytics and data visualisation techniques. There isn't a senior Executive in business today worth his or her salt that can't read a P&L statement – and it's quite conceivable that, ten years from now, a sophisticated grasp of statistics and analytics will be equally important to a successful career in business. At that point, to a greater-or-lesser extent, we're all IT professionals.
Is this really news at all – after all, data is just data, right?
There is a school of thought that claims the whole "big data" circus is just that – an invention of technology vendors with products to ship. "Data is just data" goes this line of reasoning - and we have had tools and technologies that enable us to manage large data-sets for a long time now.
Well, yes and no.
Our industry has traditionally differentiated between "structured" and "unstructured" data. Actually, I also have a problem with the expression "unstructured data" - because "unstructured" data aren't "unstructured" at all, they just aren't structured relationally. So let's call them "relational" and "non-relational" data.
It is absolutely correct to say that we know how to manage and exploit relational data. Plenty of organizations still struggle with this challenge, for a variety of reasons, some organizational, some related to poor technology choices. But in principle, we know how to manage these data as a Corporate asset and we have tools and technologies that we can use to exploit these data – for example, EPoS sales data in Retail, Call Detail Records in Telecommunications, Transaction data in Retail Finance - to drive incredible value. Teradata has 30 years experience in building integrated Data Warehouses across many different industries; the largest database built on our technology is over 40 PB in size and growing; and we have customers that support near real-time analytics - with response times measured in milliseconds - alongside traditional "Business Intelligence" applications, all sharing the same data, stored on the same platform. So whilst the field continues to advance rapidly, these are problems for which there are proven solutions and approaches.
The industry is only just beginning to get to grip with the challenge of the new, non-relational data sources, however. Think about how much valuable information there is buried inside the recording of a call between a customer and a call centre agent. As a trivial example, if I process that voice recording with appropriate digital signal processing technology then I can establish whether different types of customers are more likely to call from quiet or noisy locations, which might tell me something very useful about how I should design my IVR systems.
But when things get really interesting is when I start to integrate these data with the relational data that I already have in the Data Warehouse. If you call me to make an insurance claim and I can establish, again using appropriate digital signal processing technology, that you are stressed - and if the claims history data in the Data Warehouse tells me that this is the third questionable claim that you have made in the last twelve months - then probably I should send a Loss Assessor to validate your claim. If, on other hand, those same algorithms score your voice as calm and measured, and this is your first claim, then probably I can save myself the time-and-expense. There are, of course, important security and privacy considerations, but there are a huge range of potential applications – from sentiment analysis of social media through fraud detection and prevention in Retail Finance – where the ability to combine and analyse relational and new, non-relational data show incredible promise.
What are the consequences for IT architecture and infrastructure?
From an IT architecture / infrastructure perspective, I think that the key thing to understand about all of this is that, at least for the foreseeable future, we'll need at least two different types of "database" technology to efficiently manage and exploit the relational and non-relational data, respectively: an integrated data warehouse, built on an Massively Parallel Processing (MPP) DBMS platform for the relational data, and the relational meta-data that we generate by processing the non-relational data (for example, that a call was made at this date and time, by this customer, and that they were assessed as being stressed and agitated); and another platform for the processing of the non-relational data, that enables us to parallelise complex algorithms - and so bring them to bear on large data-sets - using the MapReduce programming model. Since the value of these data are much greater in combination than in isolation – and because we may be shipping very large volumes of data between the different platforms - considerations of how best to connect and integrate these two repositories become very important.
The hype about Hadoop
As Mark Beyer of Gartner pointed out at the Gartner BI Summit in London earlier this year, "MapReduce" has become synonymous with "Hadoop", when in fact, Hadoop is just one implementation of the MapReduce programming model. These different implementations – and even the different Hadoop distributions – all have different strengths and weaknesses. Our own Teradata-Aster product, for example, enables organizations to exploit the power of the MapReduce programming model using a much simpler interface – good, old-fashioned SQL - than many competing products, and with much greater parallel efficiency in many cases, too. So the first challenge that organization's face is in trying to understand how they are likely to want to use these new data, so that they can establish which of the different technologies is best-placed to help them address their particular challenges and opportunities, and to articulate a clear strategy for the management and exploitation of "big data" that the company can mobilize around. As no less a publication than The Economist has pointed out several times in recent years, the "big data" phenomenon – however ambiguously it is defined – is real, and organizations ignore it at their peril.
If you haven't yet heard enough from me on this topic, click below. Just be warned that I have a great face – for radio!
Director of Platform & Solutions Marketing