During my 30 years of analytics career, prospective employers and clients have often asked me: ‘How can you help us with data-driven insights when you have not worked in this industry before? ‘.
Clearly, the description of data scientist as the mythical unicorn who has computer science skills, statistical knowledge and domain expertise (Figure 1) has had an impact. The proliferation of different analytics disciplines such as social network analysis, digital analytics, bio-informatics and supply chain analytics, lends weight to the argument that domain expertise definitely matters.
There are also anecdotes on the web of data science projects that went pear shaped because the analysts were not subject matter experts. A deeper look into these anecdotes reveals that the issues are not due to a lack of domain expertise, but due to poor data science such as over-fitting of data, bad sampling methods and unnecessary data cleansing. Still the myth that domain expertise trumps all else continues!!
Data mining competitions such as Kaggle and KDD have demonstrated the opposite and shown how data science can be successfully outsourced to people without domain expertise. Many companies have run competitions on such diverse topics as optimizing flight routes, predicting ocean health and diabetic retinopathy detection. Data scientists with little or no expertise in the domain have responded brilliantly with useful solutions. Adam Kowalczyk and I won the KDD Cup on yeast gene regulation prediction with no background in biology. Some data scientists, such as David Vogel and Claudia Perlisch, have even won across multiple domains, indicating that data science skills are transferable across domains.
The counter argument to Kaggle’s success is that in these competitions, the domain experts have already generated the hypothesis by posing the right business question and preparing the data (Figure 2), and the competitors need only model and test. But, in the brave new world of massive data along with the mathematical tools and computing power to crunch these numbers, old world paradigm of hypothesizing before modeling is likely to be challenged. Google has shown a whole new way of understanding the world without any a priori models or theories with their approach to language learning.
Source: Dr. Bhavani Raskutti, Data Mining Lead, Pacific Brands, “Data Mining in Industry: Putting Theory into Practice”, guest lecture Royal Melbourne Institute of Technology, 2011.
So, if domain expertise is not necessary for the steps of posing the business question and analytical problem definition, what about data acquisition and data preparation?
In my experience, domain knowledge about data capture and transformation processes at the sensors can be acquired through exploration of the raw data. Often, good data scientists become subject experts just by playing with the data and asking questions to domain experts about the data anomalies. For instance, using just such a process, my analytics team in a manufacturing company identified a long standing, but previously undiscovered anomaly in the summarised sales and inventory feed from a large retailer. This anomaly materially affected the retail inventory reporting and had to be fixed programmatically. Subsequently, my data science team members were the acknowledged retail supply chain experts!!
Domain expertise is most relevant, perhaps, in the interpretation of insights, particularly those insights gained using unsupervised learning about the workings of complex physical processes. An example of just such a situation was the use of Aster discovery platform to perform root cause analysis of failures in a multiple aircraft fleet from aircraft sensor and maintenance data. While the analysis started with no a priori model, a post priori interpretation of the results from the path analysis and the subsequent follow-up to improve aircraft safety certainly required domain expertise.
Returning back to the original question: ‘How can you help us with data-driven insights when you have not worked in this industry before? ‘, my response is as follows.
- Machine learning (the intersection of computer science and statistics in Figure 1) brings a fresh perspective that leads to new insights and no prior domain knowledge can potentially be advantageous, especially in overcoming long standing domain bias.
- Provided the machine learners have curiosity and willingness to learn about the company and domain along with the humility to ask the domain experts about the subject, they will not only understand the domain, but through their questioning they will cross-pollinate the subject matter experts so the team as a whole is stronger.
So, when hiring a data scientist, focus on the machine learning aspect, particularly, the desire to play with the data using a number of different techniques and languages. Consider also the analytical skills to question and solve problems iteratively. Partner the data scientists with domain experts so cross-pollination can occur. This, to me, is a better pathway for bringing data science to a business than searching for the elusive unicorn depicted in Figure 1.
Bhavani Raskutti is the Domain Lead for Advanced Analytics Teradata ANZ . She is responsible for identifying and developing analytics opportunities using Teradata Aster and Teradata’s analytics partner solutions. She is internationally recognised as a data mining thought leader and is regularly invited to present at international conferences on Mining Big Data. She is passionate about transforming businesses to make better decisions using their data capital.
Latest posts by Bhavani Raskutti (see all)
- Much hADOop About No Analytics - November 2, 2016
- Why segment customers in a Big Data world? - September 19, 2016
- What You Need To Do To Get Big Data To Work For You - July 13, 2016
- What is in a Name? A Data Scientist by any other name … - June 20, 2016
- Much hADOop About No Analytics - March 29, 2016