Big Data: not unprecedented; but not bunk, either – Part II

Wednesday July 30th, 2014

In the first post of this series, I tried to describe the Big Data phenomenon and to explain how effectively exploiting the three “new waves” of Big Data has enabled organisations like Amazon, eBay, LinkedIn and Netflix to prosper.  I also made the case that whilst a lot of what we have learned over the course of the last 35 years about Information Management and Analytics is as relevant as ever, Big Data are also associated with some new tests.  In this second instalment, I will try and define these issues.

The organisations that we at Teradata have worked with who have been succeeded in moving beyond the analysis of transactions-and-events to interactions-and-observations have all mastered five key challenges, which I discuss in detail below.

(1)     The multi-structured data challenge. The transaction and event data that we have captured, integrated and analysed in traditional Data Warehouses and Business Intelligence applications for the last three decades is largely well-formed, record-oriented – and defined in terms of an explicit schema.  The same is not always true of the new sources of Big Data.  Clickstream, social and machine log data are often characterised by their volatility: the schema or information model that we use to understand them may be implicit rather than explicit; it may be “document-oriented”, meaning that it may (or may not) include some level of hierarchical organisation; it may change continuously; or we may want to apply multiple different interpretations to the data at run-time (“schema on read”), depending on the use-case and application.  Generations of budding young Business Systems Analysts – I was once one of them, although it seems a long time ago now! – were taught that business processes change continuously, but that data and their relationships do not, so model the data.  Many of the new Big Data test this maxim to destruction and make traditional approaches to data integration (which require that we apply a relatively rigid and inflexible “schema on load” to data as it enters the Analytic environment) unproductive.

(2)     The iterative Analytics challenge. Interactions – whether between people and things, people and people, or things and things – describe networks or graphs.  Many – arguably even most – useful analyses of interaction data are characterised by operations in which record order is important.  Time-series, path and graph analytics are all problematical, in varying degrees, for ANSI-standard SQL technologies, as they are based on the relational model and set theory, in which the order of records has no meaning.  The various extensions to ANSI-standard SQL that have been proposed over the years to address these limitations – among them User Defined Functions (UDFs) and Ordered Analytical OLAP functions – are only a partial solution, especially because the requirement to process multi-structured data means that we will not always know when a function is written the precise schema of the data that it will need to process.  The net-net is that these queries are often difficult to express in ANSI standard SQL – and may be computationally expensive to run on platforms optimised for set-based processing, even if we are successful in doing so.

(3)     The noisy data challenge. Some of the new Big Data sets are large and noisy; getting larger quickly; infrequently accessed to support processing which is – at least today – associated with relatively relaxed Service Level Goals; and of unproven value. Organisations faced with capturing large and growing volumes of data in which the useful signal is accompanied by an even larger volume of data that represent extraneous noise to most of the organisation – but that may represent gold dust to a small and select group of Data Scientists – are naturally highly incentivised to look for cost-effective models for storing and processing these data.

(4)     The “there might be a needle in this haystack – but if it takes 12 months and 500,000 EURO to find out, I don’t have the time and money to even go look” challenge.  Many organisations instinctively understand there is value in the new Big Data sets, but aren’t yet sure where to look for it.  The same traditional approaches to Data Integration (model the source systems; develop a new, integrated target data model; map the source models to the target model; develop ETL processes that reliably and accurately capture and transform source system data to the target model) that are often already problematical where the capture of multi-structured data are concerned are doubly problematical in these scenarios, because of the time and cost that they place between Data Scientists and access to the new data.  The costs of acquiring, cleansing, normalising and integrating data have been estimated to represent up to 70% of the total cost of deploying an Analytical database – and Extract-Transform-Load (ETL) project timescales are often measured in months and even years.  Where the data concerned will be re-used widely throughout the organisation (and possibly even beyond it) and will support mission critical business processes, or regulatory reporting, or both, this is a cost that is worth paying – and that is anyway cheaper than the alternatives.   When we want, however, not to reliably ask and answer questions but to explore new data sets to understand if they will enable us to pose new questions that are worth answering, we may need a different approach to acquiring data that provides “good enough” data quality – and that values speed and flexibility over ceremony and repeatability.  In these “exploration and discovery” scenarios, we experiment continuously on data to identify hypotheses worth testing and to identify new sources of data that drive insight.  Since many – even most – of these experiments will fail, productivity and cycle time are critical considerations; if a particular analysis on a new data set is to prove unproductive, we need to establish this early in the process so that we can “fail fast” and move on.  Only for those experiments that succeed – and that will form the basis of Analytics that need to be repeated and shared – will we consider bringing the data concerned through the traditional data integration process.

(5)     The getting past “so what?” and delivering value challenge.  I attend a lot of “Big Data” conferences and events and am constantly amazed at how many vendors – and Analysts who probably should know better – intone from the stage that “the objective of a Big project is to gain new insight about the business”.  Life would be simpler and even more beautiful if that were the case – but of course, those vendors and Analysts are only half right, because our objective must be to use that insight to change the business and so drive return on investment (ROI).  As one of my former bosses once memorably put it: “old business process + expensive new technology = expensive, old business process.” Operationalising the insights gleaned from Analytic experiments will often require that we “productionise” the data and the Analytics concerned, so that we can reliably and accurately share new KPIs, measures, events and alerts throughout the business. As increasingly important as they are to any business, Data Scientists don’t run the business – managers, clerks, customer service representatives, logistics supervisors, etc., etc. do – and insight that is not made actionable and shared beyond outside the rarefied atmosphere of a dusty Data Lab somewhere inside the walls of the Corporate Ivory Tower will not enable them to do their jobs any better than they did before.

These five key challenges and their consequences are driving the most far-reaching evolution in Enterprise Analytical Architecture since Devlin, Inmon, Kimball et. al. gave the world the Enterprise Data Warehouse.  And that is the subject of next week’s post.

Leave a Reply

Your email address will not be published. Required fields are marked *