Daily Archives: June 6, 2017

Data Science Versus Data Engineering

June 6, 2017


In the third instalment of this blog, we told you that “the analytic discovery process has more in common with research and development (R&D) than with software engineering.” But – symmetrically, if confusingly – what comes after the discovery process typically has more in common with software engineering than with R&D.

The objective of our analytic project will very often be to produce a predictive model that we can use, for example, to predict the level of demand for different products next week, to understand which customers are most at risk of churning, to forecast whether free cash flow next quarter will be above – or below – plan, etc.

For that model to be useful in a large organization, we may need to apply it to large volumes of data – some of it freshly minted – so that we can, for example, continuously predict product demand at the store / SKU level.  In other cases, we may need to re-run the model in near real-time and on an event-driven basis, for example if we are building a recommendation engine to try and cross-sell and up-sell products to customers on the web, based on both the historical preferences that they have expressed in previous interactions with the website and on what they are browsing right now.  And in almost all cases, the output of the model – a sales forecast, a probability-to-churn score, or a list of recommended products – will need to be fed back into one of the transactional systems that we use to run the business in order that we can take some useful action, based on the insight that the model has provided us.

To deliver any value to the business, then, we may need to take a model built in the lab from brown paper and string and use it to crunch terabytes, or petabytes, of data on a weekly, daily – or even hourly basis. Or we may need to simultaneously perform thousands of complex calculations on smaller data-sets – and to send the results back to an operational system within only a few hundred milliseconds. And achieving those sorts of levels of performance and scalability will require that we build a well-engineered system on top of a set of robust and well-integrated technologies.

Our system may have to ingest data from several sources, integrate them, transform the raw data into “features” that are the input to the model, crunch those features using one-or-more algorithms – and send the resulting output somewhere else.  When you hear data engineers talking about building “machine learning pipelines”, it is this fetch-integrate-transform-crunch-send process that they are referring to.

Building, tuning and optimizing these pipelines at web and Internet of Things scale is a complex engineering challenge – and one that often requires a different set of skills from those required to design-and-prove the prototype model that demonstrates the feasibility and utility of the original concept.  Some organizations put data scientists and data engineers into multi-disciplinary teams to address this challenge; others focus their data scientists on the discovery part of the process – and their data engineers on operationalizing the most promising of the models developed in the laboratory by the data scientists. Both of these approaches can work, but it is important to ensure that you have the right balance of both sets of skills.  Over-emphasize creativity and innovation and you risk creating lots of brilliant prototypes that are too complex to implement in production; over-emphasize robust engineering and you risk decreasing marginal returns, as the team focusses on squeezing the last drop from an existing pipeline, rather than considering a completely new approach and process.

Of course, not every analytic discovery project will automatically result in a complex implementation project.  As we pointed out in a previous blog, the “failure” rate for analytic projects is relatively high – so we may go several times around the CRISP-DM cycle before we settle on an approach worth implementing.  And sometimes our ability may be merely to understand. For example, a bricks and mortar Retailer might want to identify and to understand different shopping missions – and “implementation” might then be about making changes to ranging and merchandising strategies, rather than about deploying a complex, real-time software solution.

Whilst there are several ways of employing the insight from an analytic discovery project, the one thing that they all have in common is this: change. As a former boss once said to one of us: old business process + expensive new technology = expensive old business process.  And achieving meaningful and significant change in large and complex organizations is never merely about data, analytics and engineering – it’s also about organizational buy-in, culture and good change management.  Whilst data scientists and data engineers often have different backgrounds and different skill sets, one thing that they often have in common – and that they may take for granted in others – is a belief that the data know best. Since plenty of other stakeholders see the business through a variety of entirely different lenses, securing the organizational buy-in required to action the insight derived from an analytic project is often as complex a process as the most sophisticated machine learning pipeline.  Involve those other stakeholders often and early if you want to discover something worth learning in your data – and if you want that learning to change the way that you do business.

Martin_WilcoxMartin Willcox –
Senior Director, Go to Market Organisation (Teradata)

Martin is a Senior Director in Teradata’s Go-To Market organisation, charged with articulating to prospective customers, analysts and media organisations Teradata’s strategy and the nature, value and differentiation of Teradata technology and solution offerings.
Martin has 21 years of experience in the IT industry and is listed in dataIQ’s “Big Data 100” as one of the most influential people in UK data-driven business. He has worked for 5 organisations and was formerly the Data Warehouse Manager at Co-operative Retail in the UK and later the Senior Data Architect at Co‑operative Group.

Since joining Teradata, Martin has worked in Solution Architecture, Enterprise Architecture, Demand Generation, Technology Marketing and Management roles. Prior to taking-up his current appointment, Martin led Teradata’s International Big Data CoE – a team of Data Scientists, Technology and Architecture Consultants tasked withassisting Teradata customers throughout Europe, the Middle East, Africa and Asia to realise value from their Big Data assets.

Martin is a former Teradata customer who understands the Analytics landscape and marketplace from the twin perspectives of an end-user organisation and a technology vendor. His Strata (UK) 2016 keynote can be found here and a selection of his Teradata Voice Forbes blogs can be found online, including this piece on the importance – and the limitations – of visualisation.

Martin holds a BSc (Hons) in Physics and Astronomy from the University of Sheffield and a Postgraduate Certificate in Computing for Commerce and Industry from the Open University. He is married with three children and is a lapsed supporter of Sheffield Wednesday Football Club.  In his spare time, Martin enjoys playing with technology,flying gliders, photography and listening to guitar music.

Frank Sauberlich

Dr. Frank Säuberlich – Director Data Science & Data Innovation, Teradata GmbH

Dr. Frank Säuberlich leads the Data Science & Data Innovation unit of Teradata Germany. It is part of his repsonsibilities to make the latest market and technology developments available to Teradata customers. Currently, his main focus is on topics such as predictive analytics, machine learning and artificial intelligence.

Following his studies of business mathematics, Frank Säuberlich worked as a research assistant at the Institute for Decision Theory and Corporate Research at the University of Karlsruhe (TH), where he was already dealing with data mining questions.

His professional career included the positions of a senior technical consultant at SAS Germany and of a regional manager customer analytics at Urban Science International.

Frank Säuberlich has been with Teradata since 2012. He began as an expert in advanced analytics and data science in the International Data Science team. Later on, he became Director Data Science (International).