Are you a business person or executive involved in a data warehouse project where the term “normalization” keeps coming up but you have no idea what they (the technical IT folks) mean? You have heard them talk about “third” normal form and wonder if it is some new health fad or yoga position.
In my prior blog “Modeling the Data” I talked about how data integration is necessary to address many of your business priorities and that one of the first steps in data integration is to organize your data into tables. A “data model” is a graphical representation of that organization which serves as a communication tool between and within the business and IT as shown below.
So now we get to normalization. Normalization is the process that one goes through to decide in which table a type of data belongs. Let’s take a simple example. I have two tables – one contains loan account information and another contains information about individuals who may be customers (see above Figure). I have a data type called “birth date.” During the normalization process I will ask “What does this data type describe?” Does it describe the account or does it describe an individual? This answer is simple – it describes individuals. You may think that this is a piece of cake. Well, not so fast. Which table is the best fit for the data type “birth date” may be obvious to us, but many times the “best table fit” for a type of data may not be so obvious and hence you need definitions for those data types.
One example of an ambiguous data type is “balance.” Does this “balance” describe a point in time for an account? Or does it describe the sum of the balances for a group of accounts at a point in time? Maybe it should be “average balance over a time period.” Maybe it is high balance or low balance or a limit at a point in time. Maybe it is the cleared balance or a ledger balance. Maybe it is a summation of all the deposit balances held by one person at a point in time. A data model is not complete unless all its components (tables and columns) have definitions.
The normalization process can get more involved when we talk about first, second and third normal forms (and sometimes fourth and fifth). Using the birth date example, if the type of data (e.g. birth date) describes the complete meaning of the table then it is third normal form. In the above data model example, if I put birth date into the INDIVIDUAL ACCOUNT table then that would not be in third normal form because the birth date describes only part of the meaning of that table – the individual part. In this case it would be in only second normal form. By putting birth date into the INDIVIDUAL table it is in third normal form because it describes the complete meaning of the table. In most cases we take a model to third normal form but not fourth or fifth.
Why Normalize Your Data?
Why is it important to normalize your data? There are two basic reasons. (1) The first is to eliminate redundancy. When you bring your data together from different sources you will inevitably have duplications in data values for the same data type across the source systems. One example is the same person may have their name spelled differently on a loan account versus a deposit account. That person does not have two names, the name just needs to be represented in one place with one value in the right place in the integrated database. (2) The second reason is to make sure that the data is organized into tables in a way that reflects the business rule – our example of birth date describing the individual and not the account. Putting data where they logically belong will make it easier and more cost effective to maintain over the long term.
So the next time someone brings up the concept of normalization think about the buckets of data you have in the enterprise, how you need to bring it all together so you can answer those tough business questions. Finally, when you bring it together, you need to eliminate redundancy and organize data in a logical way that makes sense to the business so that your efforts and design will last over the long term. Normalization is one of the processes to get you there.
Nancy Kalthoff is the product manager and original data architect for the Teradata financial services data model (FSDM) for which she received a patent. She has been in IT over 30 years and with Teradata over 20 years concentrating on data architecture, business analysis, and data modeling.