In this blog, I would like to discuss a different way of modeling data regardless of the method such as Third Normal Form or Dimensional or Analytical datasets. This new way of data modeling will cut down the development cycles by avoiding rework, be agile, and produce higher quality solutions. It’s a discipline that looks at requirements and data as input into the design.
A lot of organizations have struggled getting the data model correct, especially for application, which has a big impact on different phases of the system development lifecycle. Generally, we elicit requirements first where the IT team and business users together create a business requirements document (BRD).
Business users explain business rules and how source data should be transformed into something they can use and understand. We then create a data model using the BRD and produce a technical requirements documentation which is then used to develop the code. Sometimes it takes us over 9 months before we start looking at the source data. This delay in engaging data almost every time causes rework since the design was based only on requirements. The other extreme end of this is when a design is based only on data.
We have always either based the design solely on requirements or data but hardly ever using both methods. We should give the business users what they want and yet be mindful of the realities of data.
It has been almost impossible to employ both methods for different reasons such as traditional waterfall method where BDUF (Big Design Up Front) is introduced without ever looking at the data. Other reasons are we work with data but the data is either created for proof of concept or testing which is farther from the realities of production data. To do this correctly, we need JIT (Just in Time) or good enough requirements and then get into the data quickly and mold our design based on both the requirements and data.
The idea is to get into the data quickly and validate the business rules and assumptions made by business users. Data-driven design is about engaging the data early. It is more than data profiling, as data-driven design inspects and adapts in context of the target design. As we model our design, we immediately begin loading data into it, often by day one or two of the sprint. That is the key.
Early in the sprint, data-driven design marries the perspective of the source data to the perspective of the business requirements to identify gaps, transformation needs, quality issues, and opportunities to expand our design. End users generally know about the day to day business but are not aware of the data.
The data-driven design concept can be used whether an organization is practicing waterfall or agile methodology. It obviously fits very nicely with the agile methodologies and Scrum principles such as inspect and adapt. We inspect the data and adapt the design accordingly. Using DDD we can test the coverage and fit of the target schema, from the analytical user perspective. By encouraging the design and testing of target schema using real data in quick, iterative cycles, the development team can ensure that target schema designed for implementation have been thoroughly reviewed, tested and approved by end-users before project build begins.
Case Study: While working with a mega-retailer, in one of the projects I was decomposing business questions. We were working with promotions and discounts subject area and we had two metrics: Promotion Sales Amount and Commercial Sales Amount. Any item that was sold as part of a promotion is counted towards Promotion Sales and any item that is sold as regular is counted towards Commercial Sales. Please note that Discount Amount and Promotion Sales Amount are two very different metrics. While decomposing, the business user described that each line item within a transaction (header) would have the discount amount evenly proportioned.
For example – Let’s say there is a promotion where if you buy 3 bottles of wine then you get 2 bottles free. In this case, according to the business user, there would be discount amount evenly proportioned across the 5 line items - thus indicating that these 5 line items are on promotion and we can count the sales of these 5 line items toward Promotion Sales Amount.
This wasn’t the case when the team validated this scenario against the data. We discovered that the discount amount was only present for the “get” items and not for the “buy” items. Using our example, discount amount was provided for the 2 free bottles (get) and not for 3 bottles (buy). This makes it hard to calculate Promotion Sales Amount for the 3 “buy” items since it wasn’t known if the customer just bought 3 items or 5 items unless we looked at all the records, which was in millions every day.
What if the customer bought 6 bottles of wine so ideally 5 lines are on promotion and the 6th line (diagram above) is commercial sales or regular sales? Looking at the source data there was no way of knowing which transaction lines are part of promotion and which aren’t.
After this discovery, we had to let the business users know about the inaccuracy for calculating Promotion Sales Amount. Proactively, we designed a new fact to accommodate for the reality of data. There were more complicated scenarios that the team discovered that the business user hadn’t thought of.
In the example above, we had the same item for “buy” and “get” which was wine. We found a scenario, where a customer bought a 6 pack of beer then got a glass free. This further adds to the complexity. After validating the business rules against source data, we had to request additional data for “buy” and “get” list to properly calculate Promotion Sales Amount.
Imagine finding out that you need additional source data to satisfy business requirements nine months into the project. Think about change request for data model, development, testing etc. With DDD, we found this out within days and adapted to the “data realities” within the same week. The team also discovered that the person at the POS system could either pick up a wine bottle and times it by 7 or he could “beep” each bottle one by one. This inconsistency makes a lot of difference such as one record versus 7 records in the source feed.
There were other discoveries we made along the way as we got into the data and designed the target schema while keeping the reality of the data in mind. We were also able to ensure that the source system has the right available grain that the business users required.
Sachin Grover leads the Teradata Agile group within Teradata. He has been with Teradata for 5 years and has worked on development of Solution Modeling Building Blocks and helped define best practices for semantic data models on Teradata. He has over 10 years of experience in IT industry as a BI / DW architect, modeler, designer, analyst, developer and tester.