Gregory Piatetsky-Shapiro knows a thing or two about extracting insight from data. He co-founded the first Knowledge Discovery and Data Mining workshop in 1989 that we briefly discussed in the second installment of this series of blogs. And he has been practicing and instructing pretty much continuously since then.
But what is it, exactly, that he has been practicing? Even Piatetsky-Shapiro might struggle to give you a consistent answer to that question, as this quote of his from 2012 hints:
Although the buzzwords describing the field have changed – from ‘knowledge discovery’ to ‘data mining’ to ‘predictive analytics’, and now to ‘data science’, the essence has remained the same – discovery of what is true and useful in mountains of data.
We like this quote a lot. Firstly, because it speaks to the fact that historically we have used at least four different terms – knowledge discovery, data mining, predictive analytics and data science – to describe substantially the same thing. The tools, techniques and technologies that we use continue to evolve, but our objective is basically the same.
And the second reason that we like this quote so much is because it contains three words that we think are key to understanding the analytic process.
Discovery. True. And Useful.
Let’s take each of these in turn.
Analytics is fundamentally about discovery. It’s about revealing patterns in data that we didn’t know existed – and extrapolating from them to try and know things that we otherwise wouldn’t know.
In fact, the analytic discovery process has more in common with research and development (R&D) than with software engineering. If we are doing it right, we should have a reasonably clear idea about the business challenges or opportunities that we are trying to address – for example, we may want to try and measure customer sentiment to establish if it is correlated with store performance and to understand which parts of the shopping experience we should try to improve to increase customer satisfaction. Or we might want to predict the failure of train-sets based on patterns in sensor data. But often we won’t know which approach is likely to be most successful, whether the data available to us can support the desired outcome – or even whether the project is feasible at all. And that means – first and foremost – that whatever we call it, analytics is about experimentation. Repeated experimentation. As Foster Provost and Tom Fawcet put it in their (excellent) textbook Data Science for Business: “the results of a given step may change the fundamental understanding of the problem.” Traditional notions of scope and requirements are therefore often difficult to apply to analytics projects.
Secondly, whilst many process models have been developed to try and codify the analytic process and so make it more reliable and repeatable – of which the Cross Industry Standard Process Model for Data Mining (CRISP-DM) shown below is probably the most successful and the most widely known – the reality is that analytics is an iterative, rather than a linear process. We can’t simply execute each step of the process in-turn and hope that insight will miraculously “pop” out of the end of the process. An unsuccessful attempt at modelling, say, customer propensity-to-buy, may cause us to re-visit the data preparation step to create new metrics that we hope will be more predictive. Or it may cause us to realize that we are insufficiently clear in our understanding of the business problem – and require us to start over. One important outcome of all of this is that “failure” rates for analytics initiatives are high. Often, these “failures” really aren’t failures in the traditional sense at all – rather they represent important learning about which approaches, tools and techniques are relevant to a particular problem. The industry refers to this as “fail fast”, although it might be more appropriate to call it a “learn quick” approach to analytics. But whatever we call it, this high failure rate has important consequences for the way we organize and manage analytic projects that we will return to later in this series.
There are many ways in which data can mislead, rather than inform us. Sometimes we can find results that appear to be interesting, but that are not statistically significant. We may conflate correlation with causality. Or we may be misled by Simpson’s paradox. Paradoxically, as Kaiser Fung points out in his book Numbersense, big data can get us into big trouble, by multiplying the number of blind alleys and irrelevant correlations that we can chase – and so causing us to waste precious time and organizational resources.
But something even more basic can also trip us up: data quality. The most sophisticated techniques, algorithms and analytic technologies are still hostage to the quality of our data. If we feed them garbage, garbage is what they will give us in return.
We cannot automatically assume that data are “true” – in particular, because the data that we are seeking to re-use and re-purpose for our analytics project are likely to have been collected to serve very different purposes. Analytics of the sort that we are undertaking may never have been intended or foreseen. That is why the CRISP-DM model places so much emphasis on “data discovery”; it is important that we first understand whether the data that are available to us are “fit for purpose” – or if we need either to change our purpose and/or to get better data.
Defining data science
So how then, should we define data science? Spend 10 minutes with Google and you will find plenty of contradictory definitions. Our personal favorite is –
Data Science = Machine Learning + Data Mining + Experimental Method
It may lack mathematical rigor, but it’s short, sweet – and, if we say so ourselves – spot-on!
Martin Willcox –
Senior Director, Go to Market Organisation (Teradata)
Martin is a Senior Director in Teradata’s Go-To Market organisation, charged with articulating to prospective customers, analysts and media organisations Teradata’s strategy and the nature, value and differentiation of Teradata technology and solution offerings.
Martin has 21 years of experience in the IT industry and is listed in dataIQ’s “Big Data 100” as one of the most influential people in UK data-driven business. He has worked for 5 organisations and was formerly the Data Warehouse Manager at Co-operative Retail in the UK and later the Senior Data Architect at Co‑operative Group.
Since joining Teradata, Martin has worked in Solution Architecture, Enterprise Architecture, Demand Generation, Technology Marketing and Management roles. Prior to taking-up his current appointment, Martin led Teradata’s International Big Data CoE – a team of Data Scientists, Technology and Architecture Consultants tasked withassisting Teradata customers throughout Europe, the Middle East, Africa and Asia to realise value from their Big Data assets.
Martin is a former Teradata customer who understands the Analytics landscape and marketplace from the twin perspectives of an end-user organisation and a technology vendor. His Strata (UK) 2016 keynote can be found here and a selection of his Teradata Voice Forbes blogs can be found online, including this piece on the importance – and the limitations – of visualisation.
Martin holds a BSc (Hons) in Physics and Astronomy from the University of Sheffield and a Postgraduate Certificate in Computing for Commerce and Industry from the Open University. He is married with three children and is a lapsed supporter of Sheffield Wednesday Football Club. In his spare time, Martin enjoys playing with technology,flying gliders, photography and listening to guitar music.
Dr. Frank Säuberlich – Director Data Science & Data Innovation, Teradata GmbH
Dr. Frank Säuberlich leads the Data Science & Data Innovation unit of Teradata Germany. It is part of his repsonsibilities to make the latest market and technology developments available to Teradata customers. Currently, his main focus is on topics such as predictive analytics, machine learning and artificial intelligence.
Following his studies of business mathematics, Frank Säuberlich worked as a research assistant at the Institute for Decision Theory and Corporate Research at the University of Karlsruhe (TH), where he was already dealing with data mining questions.
His professional career included the positions of a senior technical consultant at SAS Germany and of a regional manager customer analytics at Urban Science International.
Frank Säuberlich has been with Teradata since 2012. He began as an expert in advanced analytics and data science in the International Data Science team. Later on, he became Director Data Science (International).