One way to look at progress in technology is to recognize that each new generation provides a better version of what we’ve always wanted. If you look back at the claims for Hollerith punch card-based computing or the first generation of IBM mainframes, you find that the language is recognizable and can be found in marketing material for modern technology.
This year’s model of technology (and those from 50 or 100 years ago) will provide more efficiency, transparency, automation, and productivity. Yeehaw! I can’t wait. Oh, by the way, the current generation of big data technology will provide the same thing.
And, in fact, every generation of technology has fulfilled these enduring promises, improving on what was achieved in the past. What is important to understand is how. It is often the case that in emphasizing the “new newness” of what is coming down the pike, we forget about essential elements of value in the generation of technology that is being surpassed.
This pattern is alive and well in the current transformation taking place in the world of IT related to the arrival of big data technology, which is changing so many things for the better. The problem is that exaggerations about one aspect of what is new about big data processing, “schema on read” — the ability to add structure at the last minute — is obscuring the need for a process to design and communicate a standard structure for your data, which is called “schema on write.”
Here’s the problem in a nutshell:
• In the past, the entire structure of a database was designed at the beginning of a project. The questions that needed to be answered determined the data that needed to be provided, and well-understood methods were created to model that data, that is, to provide structure so that the questions could be answered. The idea of “schema on write” is that you couldn’t really store the data until you had determined its structure.
• The world of relational database technology and the SQL language was used to answer the questions, which was a huge improvement from having to write a custom program to process each query.
• But as time passed, more data arrived and more questions needed to be answered. It became challenging to manage and change the model in an orderly fashion. People wanted to use new data and answer new questions faster than they could by waiting to get the model changed.
Okay, let’s stop and look at the good and the bad so far. The good is that structure allowed data to be used more efficiently. The more people who used the structure, the more value it created. So, when you have thousands of users asking questions and getting answers from thousands of tables, everything is super great. Taking the time to manage the structure and get it right is worth it. Schema on write is, after all, what drives business fundamentals, such as finance.
But the world is changing fast and new data is arriving all the time, which is not the strength of schema on write. If a department wants to use a new dataset, staff can’t wait for a long process where the central model is changed and the new data arrives. It’s not even clear whether every new source of data should be added to the central model. Unless a large number of people are going to use it, why bother? For discovery, schema on read makes excellent sense.
Self-service technologies like spreadsheets and other great technology for data discovery are used to find answers from this new data. What is lost in this process is the fact that almost all of this data has structure that must be described in some way before the data is used. In a spreadsheet, you need to parse most data into columns. The end-user or analyst does this sort of modeling, not the central keeper of the database, the database administrator, or some other specialist. One thing to note about this sort of modeling is that it is done to support a particular purpose. It is not done to support thousands of users. In fact, adding this sort of structure to data is not generally thought of as modeling, but it is.
Schema on write drives the business forward. So, for big data, for any data, structure must be captured and managed. The most profound evidence of this is the way that all of the “born digital” companies such as Facebook, Netflix, LinkedIn, and Twitter have added large scale SQL databases to their data platforms. These companies were forced to implement schema on write by the needs and scale of their businesses.
Schema on read leads to massive discoveries. Schema on write operationalizes them. They are not at odds; both contribute to the process of understanding data and making it useful. To make the most of all their data, businesses need both schema on read and schema on write.
Dan Woods is CTO and founder of CITO Research. He has written more than 20 books about the strategic intersection of business and technology. Dan writes about data science, cloud computing, mobility, and IT management in articles, books, and blogs, as well as in his popular column on Forbes.com.