In recent months, I met with the Business Intelligence (BI) teams in different countries to discuss Big Data Analytics. What transpired from the meetings is clear lack of awareness of what Big Data Analytics can do for the BI team and how Big Data Analytics fit within the enterprise data warehousing (EDW). As ambassadors to their business community, BI teams have the opportunity to be at the forefront of new technology trends and be able to articulate the value of Big Data Analytics to business stakeholders.
The Big Data trend has been here for a while and there is no shortage of publically available resources on the subject. However, many of these sources do not seem to allow the audience “to see the wood for the trees”! Also, storage vendors such as Dell and EMC are not helping the situation either by confusing the BI teams with low cost storage aspects in preference over business value of Big Data Analytics. I believe that paying attention to business value of Big Data Analytics will make the BI team not only look smarter in front of the business stakeholders but also make it easier to get funding for Big Data Analytics projects which many of the BI teams are considering as an opportunity to advance their career ambition.
In the next few paragraphs below I have described in a few steps some essentials of Big Data Analytics in technical terms and how they fit into the enterprise data warehousing ecosystem as unified data architecture (UDA) that supports the next era of analytics and business insights. Many of these examples are related to the airline industry but the principles equally apply to any industry.
Step 1: Getting to know the essentials of Big Data
First step to Big Data Analytics is to understand the new technology capabilities such as Map Reduce, Hadoop, SQL-Map Reduce (SQL-MR) and how they fit within the enterprise ecosystem. It is also important to understand the differences in approaches between traditional EDW and Big Data Analytics design, development and implementation processes.
For instance, if you are in the airline industry, you would have designed the enterprise data warehouse for transactional reporting and analysis with structured stable schema and normalised data model.
You probably stored unstructured data such as ticket image, recorded audio conversations with customer service agent and ticketing / fare rules in the database as BLOB (Binary Long Object). Furthermore, you may have found it difficult to write in declarative SQL language the complex business rules such as financial settlements of inter-line agreement from code share arrangements, open jaw fare rules, say between Zone 1 and Zone 3, and business rules for fuel optimisation; so, you may have resorted to procedural languages such as user defined functions (UDF).
But UDFs have numerous limitations that MapReduce, more specifically, SQL-MapReduce (SQL-MR) makes it easy to solve while allowing for high performance parallel processing.
- What if you are able to use MapReduce API (Application Programming Interface) through which you can implement a UDF in the language of your choice?
- What if this approach allows maximum flexibility through polymorphism by dynamically allowing determination of input and output schema at query plan-time based on available information?
- What if it increases reusability by enabling inputs with many different schemas or with different user-specified parameters?
- Further, what if, SQL-MR functions can be leveraged by any BI tools that you are familiar with?
As you can guess, SQL-MapReduce (SQL-MR) overcomes the limitations of UDF by leveraging the power of SQL to enable Big Data Analytics by performing relational operations efficiently while leaving non-relational tasks to procedural MapReduce functions.
You will see some examples of this later but, first and foremost, what is MapReduce? MapReduce is a parallel programming framework invented by Google and popularised by Yahoo!.MapReduce enables parallelism for non-relational data. By making parallel programming easier, MapReduce creates a new category of tools that allows BI teams to tackle Big Data problems that were previously challenging to implement. It should be noted that unlike the core competency for parallelism of the Teradata’s relational database technology over the last 30 years, MapReduce is not a database technology. Instead, MapReduce relies on file system called Hadoop Distributed File System (HDFS). Both MapReduce and HDFS are the open source versions of the Big Data technologies.
Step 2: “Hello World” welcomes you to the world of MapReduce with “Word Count”
Let’s take look at how Hadoop MapReduce works! When you wrote your first program you may have tested it to make sure “Hello World” works by printing / displaying the words correctly. With MapReduce, you will most likely to be testing Word Counts in your MapReduce program.
A MapReduce (MR) program essentially performs a group-by-aggregation in parallel over a cluster of machines. A programmer provides a map function that dictates how the grouping is performed, and a reduce function that performs the aggregation.
Let’s say that you want to create a Book Index from Big Data Analytics for Dummies. When writing your MR program, you will provide a map function that dictates how the grouping is performed on paragraphs containing words, and a reduce function that performs the aggregation of the words to produce the book index. The MapReduce framework will assume responsibility to distribute the Map program to the cluster nodes where parts of the book is located, processed, and output to intermediate files. The output of the map processing phase is a collection of key-value pairs written to intermediate flat files. The output of the reduce phase is a collection of smaller files containing summarized data. The key-value pairs of words above are reduced to aggregates that produce the book index.
Because the MR program runs in parallel you will notice tremendous increase in reading (e.g. grouping of paragraphs from Big Data Analytics for Dummies) and processing speed (e.g. summarising and aggregation of key-value pairs) that would impress even Johnny 5
Creating an index list of words and counts from Big Data Analytics for Dummies may not be terribly interesting or useful for you but, the capability of such key-value pair generation from any multi-structured data sources can be put to analytical use by creating a set of useful dimensions and measures that the BI teams are familiar with that can be integrated with data in the EDW. Perhaps, instead of creating the Book Index, you may choose to create an index of all flight numbers, origins and destinations from the booklet of an airline time table which you may find more useful in the airline business.
Step 3: Putting MapReduce to solve business problems
Long gone are the days of GSA’s (General Sales Agents) enjoying hefty sales commissions from the airlines! The market is highly competitive and organisations are looking for best decision possible from analytics. With ubiquitous availability and convenience offered by broadband connections, customers’ attitudes and behaviours are rapidly changing. Now customers are looking for best travel and holiday packages online. They are also listening to the opinions of their friends and public remarks on social network forums. Interestingly, this is also instrumental in rapid rate at which huge volumes of data is generated, opening up the need for Big Data technologies.
What if we could utilise the multi-structured data from click streams, Facebook, Twitter data for improving business performance? What if we are able to extract the IP Address from the click stream data and correlate with the profile of the customer from EDW along with best fare for the Round The World Travel deal that the customer is looking for? What if we are able to extract the sentiment of the customer’s travel experience from Twitter and Facebook data and use the positive / negative experience to provide the Next Best Offer during the customer’s next inbound call to the agent or online visit?
Step 4: Integrating unstructured and structured data for Big Data Analytics
Here we consider how the integration of multi-structured data in MapReduce and structured data in EDW can be used for improving business outcome. You will see that instead of the MapReduce program for Word Count that you wrote previously, you will write a new MapReduce program to extract the key-value pairs for IP Address, flight deals and any other relevant information from the Apache Weblog files where the customer’s online interaction is recorded. In a later paragraph I will describe how the MapReduce program you wrote is invoked in SQL by means of SQL-MR or better still how you can leverage several pre-built functions (without having to write your own MapReduce program) using SQL-MR. For now, let’s assume the extracted data from MapReduce is created as a table in the EDW. The extracted IP Address can then be joined with Master Reference in the EDW to identify the User ID which is then used to match the frequency of online visits and lifetime value of the customer etc.
Step 5: Flying high with SQL-MR (SQL-MapReduce)!
While MapReduce is good for solving Big Data problems it can cause a number of bottlenecks, including the requirements to write software for answering new business questions. Trying to exploit data from HDFS through Apache Hive is another story; let’s not even go there! SQL-MapReduce (SQL-MR) on the other hand helps to reduce the bottleneck of MapReduce by allowing maximum flexibility through polymorphism (by dynamically allowing determination of input and output schema at query plan-time based on available information). It allows reusability by enabling inputs with many different schemas or with different user-specified parameters. More importantly, you can exploit all types of Big Data using the BI tools that you and your business analysts are familiar with.
Here you will see examples of how you may use the SQL-MR function text_parser (with just a few lines of code) to solve the word count problem / creation of a Book Index for Big Data Analytics for Dummies / extraction of IP Addresses from online clickstream data. You will notice reusability of the SQL-MR function that enables inputs with many different schemas and with different user-specified parameters to create output schema at query time.
You will find that SQL-MapReduce (SQL-MR) provides excellent framework for jump starting Big Data Analytics projects with substantial benefits, viz. 3 times faster in development efficiencies, 5 times faster in discovery and 35 times faster with analytics. My colleague, Ross Farrelly, demonstrates with an example of how to reduce the pain of MapReduce ,which will be of interest to you as well. You can see how SQL-MR provides an excellent framework for customising / developing SQL-MR functions easily with an Integrated Development Environment (IDE).
Exploring and discovering value from Big Data is how you will divide and conquer the volume, velocity, variety and complexity characteristics of Big Data. You will also gain great benefits from seamless integration of the different Big Data technologies as a Unified Data Architecture (UDA) to provide advanced analytics.
Here is another business use case that the SQL-MR functions nPath and GraphGen solve elegantly and efficiently compared to either SQL or MapReduce. Try writing this in SQL or MapReduce and notice the difference! The business problem that we are trying to solve is related to identifying the more frequent customer activities or sequence of events that lead to disloyalty.
You can see from the chart below that of all the different channels that customers use to buy airline tickets, the online channel leads to unsuccessful ticket sale. By visualising the sequence of all customer events you will notice that the Online Payment page is where abandonment occurs (i.e. noticeable from the thick purple curved line that indicates the strength of the path segment) which provides insights about the issues with the online channel. By taking corrective actions ahead of the online payment event step you will create customer loyalty and growth in sales.
Here is the SQL-MR code for the above visualisation of ticket purchase path analysis:
If you are all set and ready to go on your first class journey with Big Data Analytics then, check-in here .While ‘inflight’, treat yourself with ‘cocktail’ of analytical functions from a wide ranging selection of 70+ pre-built SQL-MR functions .
Travel smart, impress your accompanying business stakeholder, double your rewards from analytical outcomes and enjoy your journey with Big Data Analytics! By the way, don’t forget to drop me a note, if you found this useful! Bon voyage!
Sundara Raman is a Senior Communications Industry Consultant at Teradata ANZ. He has 30 years of experience in the telecommunications industry that spans fixed line, mobile, broadband and Pay TV sectors. At Teradata, Sundara specialises in Business Value Consulting and business intelligence solutions for communication service providers.
Latest posts by Sundara Raman (see all)
- 4 Tips for Improving Customer Satisfaction with Data Analytics - February 3, 2015
- Lessons for Data Lakes from a Tale of Two Seas - October 9, 2014
- New Privacy Laws Imply Data Scientist by Day, Lawyer by Night! - August 15, 2014