Analytics

Data-Driven Design: Smart Modeling in the Fast Lane

Posted on: February 24th, 2015 by Guest Blogger 2 Comments

 

In this blog, I would like to discuss a different way of modeling data regardless of the method such as Third Normal Form or Dimensional or Analytical datasets. This new way of data modeling will cut down the development cycles by avoiding rework, be agile, and produce higher quality solutions. It’s a discipline that looks at requirements and data as input into the design.

A lot of organizations have struggled getting the data model correct, especially for application, which has a big impact on different phases of the system development lifecycle. Generally, we elicit requirements first where the IT team and business users together create a business requirements document (BRD).

Business users explain business rules and how source data should be transformed into something they can use and understand. We then create a data model using the BRD and produce a technical requirements documentation which is then used to develop the code. Sometimes it takes us over 9 months before we start looking at the source data. This delay in engaging data almost every time causes rework since the design was based only on requirements. The other extreme end of this is when a design is based only on data.

We have always either based the design solely on requirements or data but hardly ever using both methods. We should give the business users what they want and yet be mindful of the realities of data.

It has been almost impossible to employ both methods for different reasons such as traditional waterfall method where BDUF (Big Design Up Front) is introduced without ever looking at the data. Other reasons are we work with data but the data is either created for proof of concept or testing which is farther from the realities of production data. To do this correctly, we need JIT (Just in Time) or good enough requirements and then get into the data quickly and mold our design based on both the requirements and data.

The idea is to get into the data quickly and validate the business rules and assumptions made by business users. Data-driven design is about engaging the data early. It is more than data profiling, as data-driven design inspects and adapts in context of the target design. As we model our design, we immediately begin loading data into it, often by day one or two of the sprint. That is the key.

Early in the sprint, data-driven design marries the perspective of the source data to the perspective of the business requirements to identify gaps, transformation needs, quality issues, and opportunities to expand our design. End users generally know about the day to day business but are not aware of the data.

The data-driven design concept can be used whether an organization is practicing waterfall or agile methodology. It obviously fits very nicely with the agile methodologies and Scrum principles such as inspect and adapt. We inspect the data and adapt the design accordingly. Using DDD we can test the coverage and fit of the target schema, from the analytical user perspective. By encouraging the design and testing of target schema using real data in quick, iterative cycles, the development team can ensure that target schema designed for implementation have been thoroughly reviewed, tested and approved by end-users before project build begins.

Case Study: While working with a mega-retailer, in one of the projects I was decomposing business questions. We were working with promotions and discounts subject area and we had two metrics: Promotion Sales Amount and Commercial Sales Amount. Any item that was sold as part of a promotion is counted towards Promotion Sales and any item that is sold as regular is counted towards Commercial Sales. Please note that Discount Amount and Promotion Sales Amount are two very different metrics. While decomposing, the business user described that each line item within a transaction (header) would have the discount amount evenly proportioned.

Data driven design graphicFor example – Let’s say there is a promotion where if you buy 3 bottles of wine then you get 2 bottles free. In this case, according to the business user, there would be discount amount evenly proportioned across the 5 line items - thus indicating that these 5 line items are on promotion and we can count the sales of these 5 line items toward Promotion Sales Amount.

This wasn’t the case when the team validated this scenario against the data. We discovered that the discount amount was only present for the “get” items and not for the “buy” items. Using our example, discount amount was provided for the 2 free bottles (get) and not for 3 bottles (buy). This makes it hard to calculate Promotion Sales Amount for the 3 “buy” items since it wasn’t known if the customer just bought 3 items or 5 items unless we looked at all the records, which was in millions every day.

What if the customer bought 6 bottles of wine so ideally 5 lines are on promotion and the 6th line (diagram above) is commercial sales or regular sales? Looking at the source data there was no way of knowing which transaction lines are part of promotion and which aren’t.

After this discovery, we had to let the business users know about the inaccuracy for calculating Promotion Sales Amount. Proactively, we designed a new fact to accommodate for the reality of data. There were more complicated scenarios that the team discovered that the business user hadn’t thought of.

In the example above, we had the same item for “buy” and “get” which was wine. We found a scenario, where a customer bought a 6 pack of beer then got a glass free. This further adds to the complexity. After validating the business rules against source data, we had to request additional data for “buy” and “get” list to properly calculate Promotion Sales Amount.

Imagine finding out that you need additional source data to satisfy business requirements nine months into the project. Think about change request for data model, development, testing etc. With DDD, we found this out within days and adapted to the “data realities” within the same week. The team also discovered that the person at the POS system could either pick up a wine bottle and times it by 7 or he could “beep” each bottle one by one. This inconsistency makes a lot of difference such as one record versus 7 records in the source feed.

There were other discoveries we made along the way as we got into the data and designed the target schema while keeping the reality of the data in mind. We were also able to ensure that the source system has the right available grain that the business users required.

Grover Sachin bio pic blog small

Sachin Grover leads the Teradata Agile group within Teradata. He has been with Teradata for 5 years and has worked on development of Solution Modeling Building Blocks and helped define best practices for semantic data models on Teradata. He has over 10 years of experience in IT industry as a BI / DW architect, modeler, designer, analyst, developer and tester.

Lots of Big Data Talk, Little Big Data Action

Posted on: February 11th, 2015 by Manan Goel No Comments

 

 Apps Are One Solution To Big Data Complexity

Offering big data apps is a great way for the analytics industry to put its muscle where its mouth is. Organizations face great hurdles in trying to benefit from the opportunities of big data.  Extracting rapid value from big data remains challenging.

To ease companies into realizing bankable big data benefits, Teradata has developed a collection of big data apps – pre-built templates that act as time-saving short cuts to value. Limited skill sets and complexity make it challenging for analytic professionals to rapidly and consistently derive actionable insights that can be easily operationalized.  Teradata is taking the lead in offering advanced analytic apps powered by Teradata Aster AppCenter to give sophisticated results from big data analytics.

The big data apps from Teradata are industry tailored analytical templates that address business challenges specific to the individual category. Purpose-built apps for retail address path to purchase and shopping cart abandonment.  Apps for healthcare map the paths to surgery and drug prescription affinity. Financial apps tackle omni-channel customer experiences and fraud.  The industries covered include consumer financial, entertainment and gaming, healthcare, manufacturing, retail, communications, travel and hospitality.

Big data apps are pre-built templates that can be further configured with help from Teradata professional services to address specific customer needs or goals.  Organizations have found that specialized big data analytic skills like Python, R, Java and MapReduce take time and require highly specialized manpower. Conversely, apps deliver fast time to value with self-service analytics. The purpose-built apps can be quickly deployed and configured/customized with minimal effort to deliver swift analytic value.

For app distribution, consumption and custom app development, the AppCenter makes big data analytics secure, scalable and repeatable by providing common services to build, deploy and consume apps.

With the apps and related solutions like AppCenter from Teradata, analytic professionals spend less time preparing data and more time doing discovery and iteration to find new insights and value.

Get more big data insights now!

 

 

Teradata Aster AppCenter: Reduce the Chasm of Data Science

Posted on: February 11th, 2015 by John Thuma No Comments

 

Data scientists are doing amazing thing with data and analytics.  The data surface area is exploding with new data sources being invented and exploited almost daily.  The Internet of Things is being realized and is not just theory, it is in practice.   Tools and technology are making it easier for Data Scientists to develop solutions that impact organizations.  Rapid fire methods for predicting churn, providing a personalized next best offer or predicting part failures are just some of the new insights being developed across a variety of industries.

But challenges remain.  Data Science has a language and technique all of its own.  Strange words like: Machine Learning, Naïve Bayes, and Support Vector Machines are creeping into our organizations.   These topics can be very difficult to understand if you are not trained or have not spent time learning to perfect them.

There is a chasm between business and data science.  Reducing this gap and operationalizing big data analytics is paramount to the success of all Data Science efforts.  We must democratize and enable anyone to participate in big data discovery.  The Teradata Aster AppCenter is a big step forward in bridging the gap between data science and the rest of us.  The Teradata Aster AppCenter  makes big data analytics consumable by the masses.

Over the past two years I have personally worked on projects with organizations spanning various vertical industries.  I have engaged with hundreds of people across retail, insurance, government, pharmaceuticals, manufacturing, and others.  The one question that they all ask is: “John, I have people that can develop solutions with Aster; how do I integrate these solutions into my organization?  How can other people use these insights?”  Great questions!

I didn’t have an easy answer, but now I do. The Teradata Aster AppCenter provides a simple to use point and click web interface for consuming big data insights.  It wraps all the complexity and great work that Data Scientists do and gives it a simple interface that anyone can use.  It allows business people to have a conversation with their data like never before.  Data Scientists love it because it gives them a tool to showcase their solutions and their hard work.

Just the other day I deployed my first application in The Teradata Aster AppCenter.  I had never built one before, nor did I have any training or a phone a friend option.  I also didn’t want to have training because I am a technology skeptic.  Technology has to be easy to use.  So I put it to the test and here is what I found.

The interface is intuitive and I had a simple application deployed in 20 minutes.  Another 20 minutes went by and I had three visualization options embedded in my App.   I then constructed a custom user interface that provides drop down menus as options to make the application more flexible and interactive.  In that hour I built an application that anyone can use and they don’t have to know how to write a single line of code or be a technical unicorn.  I was blown away by the simplicity and power.   I am now able to deploy Teradata Aster solutions in minutes and publish them out to the masses.  The Teradata Aster AppCenter reduces the chasm between Data Science and the rest of us.

In conclusion, The Teradata Aster AppCenter passed my tests.  Please, don’t take my word for it, try it out.  Also, we have an abundance of videos, training materials, and templates on the way to guide your experience.  I am really looking forward to seeing new solutions developed and watching the evolution of platform.   The Teradata Aster AppCenter gives Data Science a voice and a platform for Next Generation Analytic consumption.

Business Highlights in Big Data History

Posted on: January 22nd, 2015 by Chris Twogood No Comments

 

If you’re relatively new to Big Data, you might find this snapshot of the last 20 years of big data history helpful. Hopefully, you can build your understanding and figure out where you reside in the journey of Big Data development, adoption and optimization.

white man in brown business suitGentlemen Start Your Spreadsheets (1995) The world wide web explodes and business intelligence data began piling up – in Excel documents.

Data Storage and BI (1996) The influx of huge quantities of information brought about a new challenge. Digital storage quickly became more cost-effective for storing data than paper – and BI platforms began to emerge.

Houston, We have A Problem (1997) The term Big Data was used for the first time when researchers at NASA wrote an article identifying that the rise of data was becoming an issue for current computer systems.

Yes, Big Data was first considered a problem.

Ask Nicely (1998) By the point that enough data was able to be stored, IT departments were responsible for 80% of the business intelligence access. At this time, "predictive analysis" forecasting was starting to also change how organizations do business.

A Lotta Data (2000) The quantification of new information creation began being studied on an annual bases. In 2000, 1.5 exabytes of unique information is documented per year.

Control Freaks (2001)Papers were being written about controlling the big data problem. To describe it, they had to define it and they did so with the three V’s....data volume, velocity and variety as coined by Doug Laney, now a Gartner analyst. Work begins on capabilities like language processing, predictive modeling and data-gathering.

It Was A BIG Year (2003) The amount of digital information created by computers and other data systems in 2003 blows past the amount of information created in all of human (or big data) history prior to 2003.

Problem Child Becomes Prodigy (2005) Web 2.0 companies are assessed by their database management abilities. The issue becomes a given or core competency. Big Data begins to emerge as an opportunity.Apache Hadoop, soon to become a foundation of government big data efforts, is created.

Not Your Dad’s Oracle (2005) Alternatives (to Oracle) that are more focused on the usability of the end-user emerge. Big Data solutions that work the way people work collaboratively, on the fly and in real-time are the gold standard.

Taming the Big Data Explosion (2006) A solution to handle the explosion of big data from the web is more prevalent...Hadoop. Free to download, use, enhance and improve...like Java in the 80s. Hadoop is a 100% open source way of storing and processing data – that enables distributed parallel processing of huge amounts of data.

Can I Interest You In A Flood? (2008) The BIG part of big data starts to show itself. The number of devices connected to the Internet exceeds the world’s population.

Real questions: In 2015, will the internet be 500x larger than it is now? Will IP traffic reach one zettabyte?

How Big is Big? (2008) The term “Big Data” begins catching on among techies. Wired magazine mentions the “data deluge.” “Petabyte” age is coined...too technical to be understood...it doesn’t matter...as it is soon replaced by bigger measures like exabytes, zettabytes and yottabytes.

No They Didn’t (2008) Yes, they said it. Big Data computing is perhaps the biggest innovation in computing in the last decade. We have only begun to see its potential.

Business Intelligence became a top priority for CIO's in 2009.

BI this....BI that (2010) Recognition and use of Business Intelligence (BI) becomes common as 35% of the rank and file enterprises began to readily employ “pervasive” business intelligence. Look at best-in-class organizations, and you find an adoption of 67% – and it’s moving to self-service.

Moving On Up (2011) Business Intelligence matures with trends emerging in cloud computing, data visualization, predictive analytics and big data is on the horizon.

Big Government and Big Data (2012) The Obama administration announces the Big Data Research and Development Initiative – 84 separate programs. The National Science Foundation publishes “Core Techniques and Technologies for Advancing Big Data Science & Engineering.”

Even Better Than a Rewards Program (2013) (Big) Data as “a real business asset used to gain competitive advantage in the market” becomes accepted. The widespread drive to understand and make use of big data – to remain relevant – is well underway.

Want to leverage big data analytics for better and more efficient business? Learn more about Teradata’s big data solutions.

 

Real-Time SAP® Analytics: a look back & ahead

Posted on: August 18th, 2014 by Patrick Teunissen 1 Comment

 

On April 8, I hosted a webinar and my guest was Neil Raden, an independent data warehouse analyst. The topic of the webinar was: “Accessing of SAP ERP data for business analytics purposes” – which was built upon Neil’s findings in his recent white paper about the complexities of the integration of SAP data into the enterprise data warehouse. The attendance and participation in the webinar clearly showed that there is a lot of interest and expertise in this space. As I think back about the questions we received, both Neil and I were surprised by the number of questions that were related to “real-time analytics on SAP.”

Something has drastically changed in the SAP community!

Note: The topic of real time analytics is not new! I won’t forget Neil’s reaction when the questions came up. It was like he was in a time warp back to the early 2000’s when he first wrote about that topic. Interestingly, Neil’s work is still very relevant today.

This made me wonder why this is so prominent in the SAP space now? What has changed in the SAP community? What has changed in the needs of the business?

My hypothesis is that when Neil originally wrote his paper (in 2003) R/3 was SAP (or SAP was R/3 whatever order you prefer) and integration with other applications or databases was not something that SAP had on the radar yet. This began to change when SAP BW became more popular and gained even more traction with the release of SAP’s suite of tools and modules (CRM, SRM, BPC, MDM, etc.) -- although these solutions still clearly had the true SAP ‘Made in Germany’ DNA. Then came SAP’s planning tool APO, Netweaver XI (later PI) and, the 2007 acquisition of Business Objects (including BODS) which all accelerated SAP’s application integration techniques.

With Netweaver XI/PI and Business Objects Data Services, it became possible to integrate SAP R/3 in real time, making use of advanced messaging techniques like Idoc’s, RFC’s, and BAPI’s. These techniques all work very well for transaction system integration (EAI); however, these techniques do not have what it takes to provide real-time data feeds to the integrated data warehouse. At best a hybrid approach is possible. Back in 2000 my team worked on such a hybrid project at Hunter Douglas (Luxaflex). They combined classical ABAP-driven batch loads for managerial reports with real time capabilities (BAPI calls) for their more operational reporting needs. That was state-of-art in those days!

Finally, in 2010 SAP acquired Sybase and added a best of breed Data Replication software tool to the portfolio. With this integration technique, changed data is captured directly from the database taking the loads off of the R/3 application servers. This offers huge advantages, so it makes sense that this is now the recommended technique for loading data into the SAP HANA appliance.

“What has changed is that SAP has put the need for real-time data integration with R/3 on the (road) map!”

The main feature of our upcoming release of Teradata Analytics for SAP Solutions version 2.2 is a new data replication technique. Almost designed to prove my case, 10 years ago I was in the middle of working on a project for a large multinational company. One of my lead engineers, Arno Luijten, came to me with a proposal to try out a data replication tool to address the latencies introduced by the extraction of large volumes of changed data from SAP. We didn’t get very far at the time, because the technology and the business expectations were not ready for it. Fast forward to 2014 and we’re re-engaged with this same customer …. Luckily this time the business needs and the technology capabilities are ready to deliver!

In the coming months my team and I would like to take you on our SAP analytics journey.

In my next blogs we will dive into the definition (and relativity) of real-time analytics and discuss the technical complexities of dealing with SAP including the pool and cluster tables. So, I hope I got you hooked for the rest of the series!

Garbage In-Memory, Expensive Garbage

Posted on: July 7th, 2014 by Patrick Teunissen 2 Comments

 

A first anniversary is always special and in May I marked my first with Teradata. In my previous lives I celebrated almost ten years with Shell and seventeen years creating my own businesses focused on data warehousing and business intelligence solutions for SAP. With my last business “NewFrontiers” I leveraged all twenty seven years of ERP experiences to develop a shrink wrapped solution to enable SAP analytics. 

Through my first anniversary with Teradata, all this time, the logical design of SAP has been the same. To be clear, when I say SAP, I mean R/3 or ‘R/2 with a mouse’ if you’re old enough to remember. Today R/3 is also known as the SAP Business suite, ERP or whatever. Anyway, when I talk about SAP I mean the application that made the company rightfully world famous and that is used for transaction processing by almost all large multinational businesses.

My core responsibility at Teradata is the engineering of the analytical solution for SAP. My first order of business was focusing my team on delivering an end-to-end business analytic product suite to analyze ERP data that is optimized for Teradata. Since completing our first release, my attention turned to adding new features to help companies take their SAP analytics to the next level. To this end, my team is just putting the finishing touches on a near real-time capability based on data replication technology. This will definitely be the topic of upcoming blogs.

Over the past year, the integration and optimization process has greatly expanded my understanding of the differentiated Teradata capabilities. The one capability that draws in the attention of types like me ‘SAP guys and girls’ is Teradata Intelligent Memory. In-memory computing has become a popular topic in the SAP community and the computer’s main memory is an important part of Teradata’s Intelligent Memory. However Intelligent Memory is more than “In-Memory” -- because with Intelligent Memory, the database addresses the fact that not all memory is created equal and delivers a solution that uses the “right memory for the right purpose”. In this solution, the most frequently used data – the hottest -- is stored In-Memory; the warm data is processed from a solid state drive (SSD), and colder, less frequently accessed data from a hard disc drive (HDD). This solution allows your business to make decisions on all of your SAP and non-SAP data while coupling in-memory performance with spinning disc economics.

This concept of using the “right memory for the right purpose” is very compelling for our Teradata Analytics for SAP solutions. Often when I explain what Teradata Analytics for SAP Solutions does, I draw a line between DATA and CONTEXT. Computers need DATA like cars need fuel and the CONTEXT is where you drive the car. Most people do not go the same place every time but they do go to some places more frequently than others (e.g. work, freeways, coffee shops) and under more time pressure (e.g. traffic).

In this analogy, most organizations almost always start building an “SAP data warehouse” by loading all DATA kept in the production database of the ERP system. We call that process the initial load. In the Teradata world we often have to do this multiple times because when building an integrated data warehouse it usually involves sourcing from multiple SAP ERPs. Typically, these ERPs vary in age, history, version, governance, MDM, etc. Archival is a non-trivial process in the SAP world and the majority of the SAP systems I have seen are carrying many years of old data . Loading all this SAP data In-Memory is an expensive and reckless thing to do.

Teradata Intelligent Memory provides CONTEXT by storing the hot SAP data In-Memory, guaranteeing lightning fast response times. It then automatically moves the less frequently accessed data to lower cost and performance discs across the SSD and HDD media spectrum. The resulting combination of Teradata Analytics for SAP coupled with Teradata’s Intelligent Memory delivers in-memory performance with very high memory hit rates at a fraction of the cost of ‘In-Memory’ solutions. And in this business, costs are a huge priority.

The title of this Blog is a variation on the good old “Garbage In Garbage Out / GIGO” phrase; In-Memory is a great feature, but not all data needs to go there! Make use of it in an intelligent way and don’t use it as a garbage dump because for that it is too expensive.

Patrick Teunissen is the Engineering Director at Teradata responsible for the Research & Development of the Teradata Analytics for SAP® Solutions at Teradata Labs in the Netherlands. He is the founder of NewFrontiers which was acquired by Teradata in May 2013.

Endnotes:
1 Needless to say I am referring to SAP’s HANA database developments.

2 Data that is older than 2 years can be classified as old. Transactions, like sales and costs are often compared with the a budget/plan and the previous year. Sometimes with the year before that but hardly ever with data older than that.

MongoDB and Teradata QueryGrid – Even Better Together

Posted on: June 19th, 2014 by Dan Graham 3 Comments

 

It wasn’t so long ago that NoSQL products were considered competitors with relational databases (RDBMS). Well, for some workloads they still are. But Teradata is an analytic RDBMS which is quite different and complementary to MongoDB. Hence, we are teaming up for the benefit of mutual customers.

The collaboration of MongoDB with Teradata represents a virtuous cycle, a symbiotic exchange of value. This virtuous cycle starts when data is exported from MongoDB to Teradata’s Data Warehouse where it is analyzed and enriched, then sent back to MongoDB to be exploited further. Let me give an example.

An eCommerce retailer builds a website to sell clothing, toys, etc. They use MongoDB because of the flexibility to manage constantly changing web pages, product offers, and marketing campaigns. This front office application exports JSON data to the back-office data warehouse throughout the business day. Automated processes analyze the data and enrich it, calculating next best offers, buyer propensities, consumer profitability scores, inventory depletions, dynamic discounts, and fraud detection. Managers and data scientists also sift through sales results looking for trends and opportunities using dashboards, predictive analytics, visualization, and OLAP. Throughout the day, the data warehouse sends analysis results back to MongoDB where they are used to enhance the visitor experience and improve sales. Then we do it again. It’s a cycle with positive benefits for the front and back office.

Teradata Data Warehouses have been used in this scenario many times with telecommunications, banks, retailers, and other companies. But several things are different working with MongoDB in this scenario. First, MongoDB uses JSON data. This is crucial to frequently changing data formats where new fields are added on a daily basis. Historically, RDBMS’s did not support semi-structured JSON data. Furthermore, the process of changing a database schema to support frequently changing JSON formats took weeks to get through governance committees.

Nowadays, the Teradata Data Warehouse ingests native JSON and accesses it through simple SQL commands. Furthermore, once a field in a table is defined as JSON, the frequently changing JSON structures flow right into the data warehouse without spending weeks in governance committees. Cool! This is a necessary big step forward for the data warehouse. Teradata Data Warehouses can ingest and analyze JSON data easily using any BI tool or ETL tool our customers prefer.

Another difference is that MongoDB is a scale-out system, growing to tens or hundreds of server nodes in a cluster. Hmmm. Teradata systems are also scale-out systems. So how would you exchange data between Teradata Data Warehouse server nodes and MongoDB server nodes? The simple answer is to export JSON to flat files and import them to the other system. Mutual customers are already doing this. Can we do better than import/export? Can we add an interactive dynamic data exchange? Yes, and this is the near term goal of our partnership --connecting Teradata QueryGrid to MongoDB clusters.

Teradata QueryGrid and Mongo DB

Teradata QueryGrid is a capability in the data warehouse that allows a business user to issue requests via popular business intelligence tools such as SAS®, Tableau®, or MicroStrategy®. The user issues a query which runs inside the Teradata Data Warehouse. This query reaches across the network to the MongoDB cluster. JSON data is brought back, joined to relational tables, sorted, summarized, analyzed, and displayed to the business user. All of this is done exceptionally fast and completely invisible to the business user. It’s easy! We like easy.

QueryGrid can also be bi-directional, putting the results of an analysis back into the MongoDB server nodes. The two companies are working on hooking up Teradata QueryGrid right now and we expect to have the solution early in 2015.

The business benefit of connecting Teradata QueryGrid to MongoDB is that data can be exchanged in near real time. That is, a business user can run a query that exchanges data with MongoDB in seconds (or a few minutes if the data volume is huge). This means new promotions and pricing can be deployed from the data warehouse to MongoDB with a few mouse clicks. It means Marketing people can analyze consumer behavior on the retail website throughout the day, making adjustments to increase sales minutes later. And of course, applications with mobile phones, sensors, banking, telecommunications, healthcare and others will get value from this partnership too.

So why does the leading NoSQL vendor partner with the best in class analytic RDBMS? Because they are highly complementary solutions that together provide a virtuous cycle of value to each other. MongoDB and Teradata are already working together well in some sites. And soon we will do even better.

Come visit our Booth at MongoDB World and attend the session “The Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse” Riverside Suite, 3:10 p.m., June 24. You can read more about the partnership between Teradata and MongoDB in this news release issued earlier today. Also, check out the MongoDB blog.

PS: The MongoDB people have been outstanding to work with on all levels. Kudos to Edouard, Max, Sandeep, Rebecca, and others. Great people!

 

It happens every few years and it’s happening again. A new technology comes along and a significant segment of the IT and business community want to toss out everything we’ve learned over the past 60 years and start fresh. We “discover” that we’ve been wasting time applying unnecessary rigor and bureaucracy to our projects. No longer should we have to wait three to six months or longer to deliver technical solutions to the business. We can turn these things around in three to six days or even less.

In the mid 1990’s, I was part of a team that developed a “pilot” object-oriented, client-server (remember when these were the hot buzzwords?) application to replenish raw materials for a manufacturing function. We were upending the traditional mainframe world by delivering a solution quickly and iteratively with a small team. When the end users started using the application in real life, it was clear they were going to rely on it to do their jobs every day. Wait, was this a pilot or…? I would come into work in the morning, walk into a special room that housed the application and database servers, check the logs, note any errors, make whatever fixes needed to be made, re-run jobs, and so on.

It wasn’t long before this work began to interfere with my next project, and the end users became frustrated when I wasn’t available to fix problems quickly. It took us a while and several conversations with operations to determine that “production” didn’t just mean “the mainframe”. “Production” meant that people were relying on the solution on a regular basis to do their jobs. So we backtracked and started talking about what kind of availability guarantees we could make, how backup and recovery should work, how we could transition monitoring and maintenance to operations, and so on. In other words, we realized what we needed was a traditional IT project that just happened to leverage newer technologies.

This same scenario is happening today with Hadoop and related tools. When I visit client organizations, a frightening number will have at least one serious person saying something like, “I really don’t think ‘data warehousing’ makes sense any more. It takes too long. We should put all our data in Hadoop and let our end users access whatever they want.” It is indeed a great idea to establish an environment that enables exploration and quick-turnaround analysis against raw data and production data. But to position this approach as a core data and analytics strategy is nothing short of professional malpractice.

The problem is that people are confusing experimentation with IT projects. There is a place for both, and there always has been. Experimentation (or discovery, research, ad-hoc analysis, or whatever term you wish to use) should have lightweight processes and data management practices – it requires prioritization of analysis activity, security and privacy policies and implementation, some understanding of available data, and so on, but it should not be overburdened with the typical rigor required of projects that are building solutions destined for production. Once a prototype is ready to be used on a regular basis for important business functions, that solution should be built through a rigorous IT project leveraging an appropriate – dare I say it – solution development life cycle (SDLC), along with a comprehensive enterprise architecture plan including, yes, a data warehouse that provides integrated, shared, and trusted production data.

An experimental prototype should never be “promoted” to a production environment. That’s what a project is for. Experimentation can be accomplished with Hadoop, relational technology, Microsoft Office, and many other technologies. These same technologies can also be used for production solutions. So, it’s not that “things are done differently and more quickly in Hadoop”. Instead, it’s more appropriate to say that experimentation is different than an IT project, regardless of technology.

Yes, we should do everything we can to reduce unnecessary paperwork and to speed up delivery using proper objective setting, scoping, and agile development techniques. But that is different than abandoning rigor altogether. In fact, using newer technologies in IT projects requires more attention to detail, not less, because we have to take the maturity of the technology into consideration. Can it meet the service level needs of a particular solution? This needs to be asked and examined formally within the project.

Attempting to build production solutions using ad-hoc, experimental data preparation and analysis techniques is like building a modern skyscraper with a grass hut mentality. It just doesn’t make any sense.

Guest Blogger Kevin Lewis is responsible for Teradata’s Strategy and Governance practice. Prior to joining Teradata in 2007, he was responsible for initiating and leading enterprise data management at Publix Super Markets. Since joining Teradata, he has advised dozens of clients in all major industries. 

Take a Giant Step with Teradata QueryGrid

Posted on: April 29th, 2014 by Dan Graham No Comments

 

Teradata 15.0 has gotten tremendous interest from customers and the press because it enables SQL access to native JSON data. This heralds the end of the belief that data warehouses can’t handle unstructured data. But there’s an equally momentous innovation in this release called Teradata QueryGrid.

What is Teradata QueryGrid?
In Teradata’s Unified Data Architecture (UDA), there are three primary platforms: the data warehouse, the discovery platform, and the data platform. The huge gray arrows represent data flowing between these systems. A year or two ago, these arrows were extract files moved in batch mode.

Teradata QueryGrid is both a vision and a technology. The vision --simply said-- is that a business person connected to the Teradata Database or Aster Database can submit a single SQL query that joins data together from a second system for analysis. There’s no need to plead with the programmers to extract data and load it into another machine. The business person doesn’t have to care where the data is – they can simply combine relational tables in Teradata with tables or flat files found in Hadoop on demand. Imagine a data scientist working on an Aster discovery problem and needing data from Hadoop. By simply adjusting the queries she is already using, Hadoop data is fetched and combined with tables in the Aster Database. That should be a huge “WOW” all by itself but let’s look further.

You might be saying “That’s not new. We’ve had data virtualization queries for many years.” Teradata QueryGrid is indeed a form of data virtualization. But Teradata QueryGrid doesn’t suffer from the normal limitations of data virtualization such as slow performance, clogged networks, and security concerns.

Today, the vision is translated into reality as connections between Teradata Database and Hadoop as well as Aster Databases and Hadoop. Teradata QueryGrid also connects the Teradata Data Warehouse to Oracle databases. In the near future, it will extend to all combinations of UDA servers such as Teradata to Aster, Aster to Aster, Teradata to Teradata, and so on.

Seven League Boots for SQL
With QueryGrid, you can add a clause in a SQL statement that says “Call up Hadoop, pass Hive a SQL request, receive the Hive results, and join it to the data warehouse tables.” Running a single SQL statement spanning Hadoop and Teradata is amazing in itself – a giant step forward. Notice too that all the database security, advanced SQL functions, and system management in the Teradata or Aster system is supporting these queries. The only effort required is for the database administrator to set up a “view” that connects the systems. It’s self-service for the business user after that. Score: complexity zero, business users one.

Parallel Performance, Performance, Performance
Historically, data virtualization tools lack the ability to move data between systems in parallel. Such tools send a request to a remote database and the data comes back serially through an Ethernet wire. Teradata QueryGrid is built to connect to remote systems in parallel and exchange data through many network connections simultaneously. Wanna move a terabyte per minute? With the right configurations it can be done. Parallel processing by both systems makes this incredibly fast. I know of no data virtualization system that does this today.

Inevitably, the Hadoop cluster will have a different number of servers compared to the Teradata or Aster MPP systems. The Teradata and Aster systems start the parallel data exchange by matching up units of parallelism between the two systems. That is, all the Teradata parallel workers (called AMPs) connect to a buddy Hadoop worker node for maximum throughput. Anytime the configuration changes, the workers match-up changes. This is non-trivial rocket-science class technology. Trust me – you don’t want to do this yourself and the worst situation would be to expose this to the business users. But Teradata QueryGrid does it all for you completely invisible to the user.

Put Data in the Data Lake FAST
Imagine complex predictive analytics using R® or SAS® are run inside the Teradata data warehouse as part of a merger and acquisition project. In this case, we want to pass this data to the Hadoop Data Lake where it is combined with temporary data from the company being acquired. With moderately simple SQL stuffed in a database view, the answers calculated by the Teradata Database can be sent to Hadoop to help finish up some reports. Bi-directional data exchange is another breakthrough in the Teradata Query Grid, new in release 15.0. The common thread in all these innovations is that the data moves from the memory of one system to the memory of the other. No extracts, no landing the data on disk until the final processing step – and sometimes not even then.

Push-down Processing
What we don’t want to do is transfer terabytes of data from Hadoop and throw away 90% of it since it’s not relevant. To minimize data movement, Teradata QueryGrid sends the remote system SQL filters that eliminate records and columns that aren’t needed. An example constraint could be “We only want records for single women age 30-40 with an average account balance over $5000. Oh, and only send us the account number, account type, and address.” This way, the Hadoop system discards unnecessary data so it doesn’t flood the network with data that will be thrown away. After all the processing is done in Hadoop, data is joined in the data warehouse, summarized, and delivered to the user’s favorite business intelligence tool.

Teradata QueryGrid delivers some important benefits:
• It’s easy to use: any user with any BI tool can do it
• Low DBA labor: it’s mostly setting up views and testing them once
• High performance: reducing hours to minutes means more accuracy and faster turnaround for demanding users
• Cross-system data on demand: don’t get stuck in programmer’s work queue
• Teradata/Aster strengths: security, workload management, system management
• Minimum data movement improves performance and reduces network use
• Move the processing to the data

Big data is now taking giant steps through your analytic architecture --frictionless, invisible, and in parallel. Nice boots!

Change and “Ah-Ha Moments”

Posted on: March 31st, 2014 by Ray Wilson No Comments

 

This is the first in a series of articles discussing the inherent nature of change and some useful suggestions for helping operationalize those “ah-ha moments."

Nobody has ever said that change is easy. It is a journey full of obstacles. But those obstacles are not impenetrable and with the right planning and communication, many of these obstacles can be cleared away making a more defined path for change to follow.   

So why is it that we often see failures that could have been avoided if changes that are obvious were not addressed before the problem occurred? The data was analyzed and yet nobody acted on these insights. Why does the organization fail to what I call operationalize the ah-ha moment? Was it a conscious decision? 

From the outside looking in it is easy to criticize organizations for not implementing obvious changes. But from the inside, there are many issues that cripple the efforts of change, and it usually boils down to time, people, process, technology or financial challenges.  

Companies make significant investments in business intelligence capabilities because they realize that hidden within the vast amounts of information they generate on a daily basis, there are jewels to be found that can provide valuable insights for the entire organization. For example, with today's analytic platforms business analysts in the marketing department have access to sophisticated tools that can mine information and uncover reasons for the high rate of churn occurring in their customer base. They might do this by analyzing all interactions and conversations taking place across the enterprise and the channels where customers engage the company. Using this data analysts then begin to  see various paths and patterns emerging from these interactions that ultimately lead to customer churn.   

These analysts have just discovered the leading causes of churn within their organization and are at the apex of the ah-ha moment. They now have the insights to stop the mass exodus of valuable customers and positively impact the bottom line. It’s obvious these insights would be acted upon and operationalized immediately, but that may not be the case. Perhaps the recently discovered patterns leading to customer churn touch so many internal systems, processes and organizations that getting organizational buy in to the necessary changes is mired down in a endless series of internal meetings.   

So what can be done given these realities? Here’s a quick list of tips that will help you enable change in your organization:

  • Someone needs to own the change and then lead rather than letting change lead him or her.
  • Make sure the reasons for change are well documented including measurable impacts and benefits for the organization.
  • When building a change management plan, identify the obstacles in the organization and make sure to build a mitigation plan for each.
    Communicate the needed changes through several channels.
  • Be clear when communicating change. Rumors can quickly derail or stall well thought out and planned change efforts.
  • When implementing changes make sure that the change is ready to be implemented and is fully tested.
  • Communicate the impact of the changes that have been deployed.  
  • Have enthusiastic people on the team and train them to be agents of change.
  • Establish credibility by building a proven track record that will give management the confidence that the team has the skills, creativity and discipline to implement these complex changes. 

Once implemented monitor the changes closely and anticipate that some changes will require further refinement. Remember that operationalizing the ah-ha moment is a journey.  A journey that can bring many valuable and rewarding benefits along the way. 

So, what’s your experience with operationalizing your "ah-ha moment"?