What does philosophy have to do with Big Data, I hear you ask. Bear with me – all will be explained.
Donald Rumsfeld famously said “There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don’t know. But there are also unknown unknowns. There are things we don’t know we don’t know.”
Donald Rumsfled, February 2002
But the data world is not this clear-cut; not only are there things we know or don’t, there are also whole domains of data where we are just not sure.
So we need to add two new boxes to the diagram:
Domain 1: “I am certain of its uncertainty” – I can quantify the level of unreliability.
Domain 2: “ I am uncertain of its uncertainty” – I know that it is not reliable data, but I don’t know how unreliable it is. Simplistic (and arguable) example: Trip Advisor data; even with a large number of reviews, I can’t be certain that they represent reality.
Let’s first place these two new domains on Rumsfeld’s diagram, then look at a real-life example.
So, where would the “certain uncertainties” and “uncertain uncertainties” fit?
I would place them somewhere around the middle, as in the second diagram.
Let’s look at a real-life example
A Telco wants to sell socio-economic information about its customers, for direct-marketing purposes. The problem is that it knows close-to-nothing about its pre-paid customers: they buy a SIM without giving any personal information.
Can the Telco find out any socio-economic parameters about this population?
The only data we have is usage data: we know the location and duration of calls; we know the location and web-address of web-surfing activities.
Using Teradata Aster solution, we try the following:
- Identify the gender by analysing web activities. Using known subscribers, we identify the top gender-specific web sites (men use more gambling, sport, etc; women use more dating, picture-sharing, online clothes-shopping etc’. Hey, don’t shoot the messengerJ). We then use this on a test-set and achieve 75% success in ‘guessing’ the gender. Now we can be certain of our uncertainty when applying this to an unqualified data set.
- Identify higher-income customers by locating frequent domestic flyers. We identify subscribers who made a call from the vicinity of a domestic airport and another call from the vicinity of another airport with a time-gap shorter than the possible driving time between them. Once again, trying this on a known data set results in 80% confidence in this approach. Another certain uncertainty.
- Find where people live, then use this to identify their income level. The team does this by assuming that calls made before 7am and after 10pm are made from home. It identifies calls made at these times from the same location on different dates and takes that as their home location. It then uses publicly-available socio-economic data about neighbourhoods to assign an income-band to each subscriber. This technique achieves 42% match (compared with known data) and is thus discarded. This is an uncertain uncertainty. Therefore the risk of using it is too high.
To summarise: we start with completely unknown data and explore several avenues. We use known data to estimate our confidence (our level of uncertainty). Some avenues lead to successful and repeatable results; some are a dead-end (which is a very certain uncertainty). We have identified our uncertain uncertainties and converted our certain uncertainties into known-knowns.
Finally, what about the philosophy angle?
Socrates said (and Plato wrote down) “…only these two things, true belief and knowledge, guide correctly, and that if a man possesses these he gives correct guidance.” (Socrates, in Plato’s Meno Dialogue, 99A).
In other words, you need to know your uncertainties to have true knowledge. Otherwise, it’s only a guess.
Ben Bor is a Senior Solutions Architect at Teradata ANZ, specialist in maximising the value of enterprise data. He gained international experience on projects in Europe, America, Asia and Australia. Ben has over 30 years’ experience in the IT industry. Prior to joining Teradata, Ben worked for international consultancies for about 15 years and for international banks before that. Connect with Ben Bor via Linkedin.