Dark Data – ahem – Matters

Physicists tell us that the universe has tremendous amounts of dark matter – about 84.5% of the total matter in the universe to be exact.  But since it neither emits nor absorbs light, it’s hard to measure dark matter.  Thus it is largely ignored except by a few scientists.

Down here on planet earth, our data centers have copious amounts of dark data.  Like dark matter, all we really know is there is a lot of it.   Dark data is the data that passes through a corporation that isn’t saved — it is unused, invisible, and eventually we throw it away.  Most corporations know about their dark data but haven’t found a business use for it.  Or they have not tried to capture and use any of it.  Or they fear the costs of grappling with huge volumes of data.  Or all three.

Many of us are familiar with the two most common forms of dark data: social media data and machine generated data.  Social media data comes from people so it’s complex and messy.  It’s mostly text, person-to-person relationships, or mouse click-streams.  These complex data structures look somewhat like a bowl of spaghetti that must be unraveled before they are useful.   Within this data, we can detect buyer’s purchasing preferences and the people who influence buying decisions.  In contrast, machine generated data comes from sensors, TV set top boxes, health monitors, mobile per call measurement data (PCMD), or other sources.  Machine generated data tends to be structured, almost relational, so sophisticated skills are not needed to unravel their mysteries.   Machine data has obvious potential for high payback, but its size can be frightening.

Visionary corporations and fast followers are already mining value from dark data to solve business objectives.   How do they do it?   With common sense, a data scientist, and the following steps:

  • Identify data sources that are not used for analysis today
  • Brainstorm several business problems the data could possibly solve
  • Grab a sample of that data for a test against your hypotheses
  • Explore and discover if the dark data subset has value or not
  • Based on the value found:

Don’t assume that high value equals high costs.  As it turns out, both social media and machine generated data start out exceptionally large – up to terabytes per day.   But after the useful data is distilled from all the dark data, the size shrinks 20-to-1, 50-to-1, sometimes more.  In the end, terabytes distill down to gigabytes which are easier and cheaper to work with.  Similarly, there are numerous kinds of analysis that can be applied – data mining, quantitative analysis, statistics, even simple joins of dark data to production data can be illuminating– that don’t require new staffing, skills, or costs.   If you already have a strong analytic discovery process, it’s probably time to step up to new technologies like SQL-MapReduce to extend your competitive lead.

There are two keys to success:  the discovery environment and operationalizing high value discoveries.   First, visionary companies build teams and an environment for continuous prospecting dark data, searching for nuggets of business value.   You might already have a “sand box” discovery environment using SAS®, Teradata Data Labs, or Teradata Aster.  Notably, discovery involves mistakes and dead ends before that “aha” insight arrives.   As author James Joyce said “Mistakes are the portals of discovery.”

Second, operationalize the dark data which has a good ongoing payback.   This converts dark data to production data, i.e. it’s no longer dark.  Load it into the data warehouse and/or connect it to the right business processes.  Since the data has high business value, it shouldn’t be too hard to get business users to follow through on execution at this point.

There’s a lot more dark data than social media and sensors.  There is the deep web (data not found by search engines), email, mobile phone apps, SCADA Industrial control systems, documents, and many more.  Some dark data is hidden because it’s in other government agencies, supplier systems, or peer corporate divisions.  There’s an entire universe of diverse information assets.

Many IT shops fear dark data, a simple fear of the unknown.  But for those companies that exploit analytics for competitive advantage, dark data is merely an untapped universe of opportunities waiting to be discovered.   Pull out your telescope, go look for use cases that combine diverse data sets into analytic solutions.  There’s lots of dark data in your data center.

Dan Graham

Dark matter simulation graphic from San Diego Supercomputer Center.

Leave a Reply

Your email address will not be published. Required fields are marked *