Spotting the pretenders in Data Science

Wednesday February 15th, 2017

The term “Data Scientist” is often over-used or even abused in our industry. Just the other morning I was watching TV and a news piece came on talking about the hottest careers in 2017 and data science was top of the list. Of course this is good for those who have been dealing with data for many years in one shape or another because your skills will be in demand. However the bad news is that the industry gets flooded with fakes all looking to get in on the action. It really is the wild west.

The problem with the industry is that there is not an official certification program like say Microsoft or Cisco certification programs. Therefore it is often difficult for an employer to identify how good they say they really are. Some might have a background in data and may be able to punch out some lines of SQL, but that doesn’t make a data scientist.

You can rely on the old method of making contact with references but we all know that can be fraught with danger as you’ll often reach the prospective employee’s best friend or someone who has been coached on what to say when they are called.
And most of the time, the prospective employee will be unable to show you the types of projects they have previously worked on because it may be commercially sensitive or just plain difficult to demonstrate in an interview.

Ben Davis_Data ScienceWhat makes the hiring process so much more difficult is that you are often under pressure to hire because data based projects are considered a priority within your organisation and are being carefully watched by management, therefore you must hire quickly and hire quality to deliver. The pressure is on you to get it right from the start.

So what more can one do to weed out the fake data scientists?

I’ve listed some interview questions below that will reveal how good of a data scientist they really say they are:

Q: If you had a choice of a Machine Learning algorithm, which one would you choose and why?
This is a trick question. Everyone should have a “go-to” algorithm that’s the easy part of the question, the devil lies in the 2nd part of the question the “why”. A good Data Scientist should be able to explain why they prefer the algorithm they mentioned and give an explicit answer as to it’s applicability or flexibility. If they went a step further and compared and contrasted their favourite algorithm with an alternate approach it would demonstrate an intricate knowledge of the algorithm.

Q: You’ve just made changes to an algorithm. How can you prove those changes make an improvement?
Once again you’re not seeking the obvious answer, rather testing the data scientists ability to demonstrate reason. In a research degree you have to demonstrate components of your research such as:
• The results are repeatable
• The demonstration of the before and after test are performed within a controlled environment using the same data and same hardware on both occasions.
• Ensuring that the test data is of sufficient quantity and quality to test your algorithm accurately. For example don’t test it on a small dataset and then roll it into production against a huge dataset with a lot more variables.

The key with his answer is that you are seeking to see how scientific the applicant is. Such a question would potentially give you an insight into their background, do they come from an academic background?

Q: Give an example approach for root cause analysis.
Wikipedia states that root cause analysis is “a method of problem solving used for identifying the root causes of faults or problems. A factor is considered a root cause if removal thereof from the problem-fault-sequence prevents the final undesirable event from recurring; whereas a causal factor is one that affects an event’s outcome, but is not a root cause”
This question seeks to understand if the applicant has ever performed these types of investigations in the past to troubleshoot an issue in their code. Once again we’re not looking for an explanation of what is root cause analysis, more so how root cause analysis may have been used in the past to solve something they were working on.

Each member of your team should have specific sets of skills that they bring to the table that compliments the team.

Q: Give examples of when you would use Spark versus MapReduce.
There are many answers to this question for example in-memory processing using RDD’s on small datasets is faster than MapReduce which has a higher IO overhead. You’re also looking for flexibility in a data scientist. There are many approaches a Data Scientist can take that lead to the same outcome. For example MapReduce may get to the same answer as Spark, albeit just a bit slower. But knowing when to use which approach and why is a valuable skill for a data scientist to have.

Q: Explain the central limit theorem.
Many data scientists come from a background of statistics. This question is testing a basic knowledge of statistics that any statistician should know if they are applying for a role as a Data Scientist. There’s a whole blog that compares and contrasts the role of a Data Scientist and a Statistician, however you may be seeking to build a data science team with a wide range of skills including statistics.

By the way CLT is a fundamental theorem of probabilities in that across a large distribution of data the mean of the variances will be approximately equal to the mean of the data itself. There’s many other explanations of CLT available online.

Q: What are your favourite Data Science websites?
This is attempting to find out how passionate they are about data science. A good Data Scientist would obviously bring up the usuals such as kdknuggets or Data Science Central You want to hire Data Scientists that not only use these sites, but keep their pulse on what’s happening, engage online with other like-minded individuals and you never know your next hire may come from one of these sites.

At the end of the day, you are not only assessing their knowledge but what skills and knowledge they would bring to your data science team.

In a previous blog on ‘Seven traits on successful Data Science teams‘ I discussed forming a team with varied skills. You don’t want a Data Science team of clones. Each member of your team should have specific sets of skills that they bring to the table that compliments the team. Get your interview questions formed well before the interview and you’re well on the way to building that special team.

Category: Ben Davis Tags: , , , , , , , , ,

About Ben Davis

Ben Davis is responsible for pre-sales activities in the Federal Government market in Canberra. Ben consults across a broad range of government departments developing strategies to better manage the continual flood of data that these organisations are now facing. Ben is a firm believer that management of data is a continual process not a one off project that if managed correctly will deliver multiple benefits to organisations for strategic decision making. Previously Ben spent 6 years at IBM in a Senior Data Governance role for Software Group Australia & New Zealand and 10 years at Fujitsu in pre-sales and consulting roles. Ben holds a Degree in Law from Southern Cross University, a Masters in Business & Technology from the University of New South Wales and is currently studying for his PhD in Information Technology at Charles Sturt University. His thesis studies focus on data security, cloud computing and database encryption standards.

Leave a Reply

Your email address will not be published. Required fields are marked *