In a blog I posted in early December, I talked about the total cost of big data. That post, and today’s follow-up post, stem from a webinar that I moderated between Richard Winter, President of Wintercorp, specializing in massive databases, and Bob Page, VP of Products at Hortonworks. During the webinar we discussed how to successfully calibrate and calculate the total cost of data and walked through important lessons related to the costs around running workloads on various platforms including Hadoop. If you haven’t listened to the webinar yet, I recommend you do so.
From the discussion we had during that session and from resulting conversations I have had since, I wanted to address some of the key takeaways we discussed about how to be successful when tackling such a large challenge within your organization. Here are a few key points to consider:
1. Start Small: As Bob Page said, “It’s very easy to dream big and go overboard with these projects, but the key to success is starting small.” Have your first project be a straightforward proof of concept. There are undoubtedly going to be challenges when you are starting your first big data project, but if you can start at a smaller level and build your knowledge and capabilities, your odds of success for the larger projects improve. Don’t make your first venture out of the gate an attempt at a gargatuan project or huge amount of data. When you have some positive results, you will also have the confidence and sanction to build bigger solutions.
2. Address the Entire Scope of Costs: Rather than making the mistake of focusing on upfront purchasing costs only, any total cost of data evaluation must incorporate all possible costs, reflecting an estimate of owning and using data over time for analytic purposes. The framework that Richard developed allows you to do exactly that. It is a framework for estimating the total cost of a big data initiative. During the webinar, Richard discussed the five components of system costs:
- the hardware acquisition costs
- the software acquisition costs
- what you pay for support
- what you pay for upgrades
- and what you pay for environmental/infrastructure costs – power and cooling.
According to Richard, we need to estimate the CAPEX and OPEX over five years. Based on his extensive experience, he also recommends a moderate annual growth assumption of 26 percent in the system capacity. In my experience, most data warehouses double in size every 3 years so Richard is being conservative. Thus the business goal coupled with the CAPEX and OPEX thresholds year by year helps keep the team focused. For many technical people, the TCOD planning seems like a burden, but it’s actually a career saver. If you are able to control the scope at a relatively low level and can leverage a tool - such as Richard’s framework – you have a higher chance of being successful.
3. Comparison Shop: Executives want to know the total cost of carrying out a large project, whether it is on a data warehouse or Hadoop. Having the ability to compare overall costs between the two systems is important to the overall internal success of the project and to the success of future projects being evaluated as well. Before you can compare anything, it is important to identify a real workload that your business and the executive team can consider funding. The real workload focuses the comparisons as opposed to generalizations and guesses. At some point a big data platform selection will generate two analyses you need to work though: 1) what is this workload costing? and 2) which platform can technically accomplish the goals more easily?” Lastly, in a perfect world, the business users should also be able to showcase the business value of the workload.
4. Align Your Stakeholders: Many believe that 60 percent of the work in a project should be in the planning and 40 percent of the execution. In order to evaluate your big data project appropriately, you must incorporate as many variables as possible. It’s the surprises and stakeholders who weren’t aligned that cause a lot of the big cost over runs. Knowing your assets and stakeholders is key to succeeding. Which is why we recommend using the TCOD framework to get stakeholders to weigh in and achieve alignment on the overall plan. Next, leverage the results as a project plan that you can use toward achieving ROI. By leveraging a framework such as the one that Richard discusses during the webinar, what becomes very clear is that having each assumption, each formula and each of the costs exposed within this framework (in Richard’s there are 60 different costs outlined!), you can identify much more easily where the costs differ and – more importantly – why. The TCOD framework can bring stakeholders into the decision-making process, forming a committed team instead of bystanders and skeptics .
5. Focus on Data Management: One of the things that both of our esteemed webinar guests pointed out is the importance of the number of people and applications accessing big data simultaneously. Data is typically the life-blood of the organization. This includes accessing live information about what is happening now, as well as accurate reporting at the end of the day, month, and quarter. There is a wide spectrum of use cases and each is being used across a wide variety of data types. If you haven’t actually built a 100-terabyte database or distributed file system before, be ready for some painful “character building” surprises. Be ready again at 500TB, at a petabyte, and 5 petabytes. Big data volumes are like the difference between a short weekend hike and making it past base camp on Mount Everest. Your data management skills will be tested.
During the webinar, our experts all agreed: there is a peaceful coexistence that can happen between Hadoop and the data warehouse. They should be applied to the right workloads and share data as often as possible. When a workload is defined, it becomes clear that some data belongs in the data warehouse while other types of data may be more appropriate in Hadoop. Once you have put your data into its enterprise residence, each will feed their various applications.
In conclusion, being able to leverage a framework, such as the TCOD one that was discussed during the webinar, really lends itself to having a solid plan when approaching your big data challenges and to ultimately solving them.
Here are some additional resources for further information:
Latest posts by Dan Graham (see all)
- Data Lake Best and Worst Practices - November 10, 2014
- MongoDB and Teradata QueryGrid – Even Better Together - June 19, 2014
- How $illy is Cost per Terabyte? - May 16, 2014