Author Archives: Martin Willcox

Talking about real-time analytics? Be clear about what’s on offer

Tuesday January 17th, 2017

The inexorable increase in competition around the globe has led to an explosion of interest in real-time and near real-time systems.

Yet despite all this understandable attention, many businesses still struggle to define what “real-time” actually means.

A merchandiser at a big box retailer, for example, may want a sales dashboard that is updated several times a day, whereas a marketing manager at a mobile Telco wants the capability to automatically send offers to customers within seconds of them tripping a geo-fence. Her friend in capital markets trading, meanwhile, may have expectations of “real-time” systems that are measured in microseconds.

Since appropriate solutions to these different problems typically require very different architectures, technologies and implementation patterns, knowing which “real-time” we are dealing with really matters.

Before you start, think about your goals

Real-time systems are usually about detecting an event – and then making a smart decision about how to react to it.  The Observe-Orient-Decide-Act or “OODA loop” gives us a useful model for the decision-making process.  Here are some tips for business leaders about how to minimise confusion when engaging with I.T. at the start of a real-time project:

  1. Understand how the event that we wish to respond to will be detected. Bear in mind that this can be tough – especially if the “event” we care about is one when something that should happen does not. Or which represents the conjunction of multiple events from across the business.
  1. Clarify who will be making the decision – man, or machine? Humans have powers of discretion that machines sometimes lack, but are much slower than a silicon-based system, and only able to make decisions one-at-a-time, one-after-another.  If we chose to put a human in the loop, we are normally in “please-update-my-dashboard-faster-and-more-often” territory.
  2. It is important to be clear about decision-latency. Think about how soon after a business event you need to take a decision and then implement it. You also need to understand whether decision-latency and data-latency are the same. Sometimes a good decision can be made now on the basis of older data. But sometimes you need the latest, greatest and most up-to-date information to make the right choices.
  1. Balance decision-sophistication with data-availability. Do you need to use more, potentially older, data to take a good decision, or can you make a “good enough” decision with less data? Think that through.

Can you win at both ends?

Let’s consider what is required if you want to send a customer an offer in near real-time when she is within half-a-mile of a particular store or outlet.  It can be done solely because she has tripped a geo-fence, which means all that is required is the information about where the customer is now.

But you will certainly need access to other data if you want to know if the same offer has been made to her before and how she responded. Or, if you want to know which offers customers with similar patterns of behaviour have responded to an offer in the last six months. That additional data is likely to be stored outside the streaming system.

Providing a more sophisticated and personalised offer to this customer will cost the time it takes to fetch and process that data, so “good”, here, may be the enemy of “fast”. We might need to choose between “OK right now” or “great, a little later”.  That trade-off is normally very dependent on use-case, channel and application.

Rigging the game in your favour

Of course, I can try and manipulate the system – by working out beforehand the next-best actions in relation to a variety of different scenarios I can foresee. This is instead of retrieving the underlying data and crunching it in response to events that I have just detected. With this kind of preparation, I can at least try to be fast and good.

But then the price I pay is reduced flexibility and increased complexity. And the decision is based on the data from our previous interactions, not the latest data.

All these options come with different costs and benefits and there is no wrong answer – they are all more or less appropriate in different scenarios.  But make sure that you understand your requirements before IT starts evaluating streaming and in-memory technologies for a real-time system.

What Enterprise Information Management Can Learn From Facebook And Wikipedia

Wednesday August 31st, 2016

The business of ensuring that corporate data assets are reliable and re-useable – Enterprise Information Management (EIM) – has never mattered more. Because we are increasingly leveraging more data, more often, to measure and optimise more business processes. And because “garbage in, garbage out” is as true today as it was when it was first coined, way back when.

Unfortunately, traditional EIM methods and processes are either at breaking point in most organisations – or are, in fact, already broken.

teradata_4

Traditional approaches to EIM leverage a set of well-known, well-understood and interconnected methods: Meta-Data Management; Data Quality Management; Data Integration; Master Data Management; and Data Access and Security.

These are woven together through constructs like Data and Architecture Governance Boards, which define policies, principles, rules and standards that enable the design of end-to-end solutions with EIM capabilities built-in, not bolted-on after the fact.

But here’s a dirty little secret. Very few companies have either the discipline or the resources to rigorously apply all of these methods and processes to the data that are already captured in their existing Data Warehouses.

If you assume that your bank is protecting the copy of your Personally Identifiable Information (PII) that it stores in its Data Warehouse with strong encryption, for example, you are likely being generous.

Pushing the Limits

The situation gets worse, when you consider that organisations today need to not only deal with data from within, but also data that originate outside the corporation, like social media data. These types of data typically need to be interpreted in multiple different ways depending on the context of the analysis.

That’s why more-and-more “Logical Data Warehouses” are being deployed to extend and enhance the capability of existing decision support systems so that they can capture and analyse much more data.

But we have to acknowledge that something is going to have to give. Not because EIM doesn’t matter in the brave new world of the Logical Data Warehouse, but rather because the processes and organisational models that aren’t quite good enough today just won’t scale to “100x” volumes and complexities, which is where the “Sentient Enterprise” is going to be tomorrow.

Evolution is Inevitable

All of which means that Information Management and associated models of governance are going to have to evolve – indeed, are already evolving at leading organisations that are adapting to big data.

If we want to know what the future of EIM looks like, we need only look to Wikipedia and Facebook. Because what Wikipedia and Facebook teach us is that social models of content curation and collaboration do scale.

The Future of EIM

Now before the veteran EIM practitioners throw up their hands in horror, I am not suggesting the wholesale, laissez-faire abandonment of data access rules and policies. Or that organisations give up on integrating data that are frequently re-used, shared and compared across different departments. Or that those same organisations stop worrying about the accuracy of the financial metrics that they report to Wall Street.

But what I’m saying is that organisations will increasingly need to crowd-source a lot of their meta-data. They’ll need to know when to live with “good enough” quality for some, less critical data. They will need to galvanise the entire organisation – and indeed partners and suppliers outside it – to the task of figuring out when and how different data can be leveraged for different purposes. And to find ways of making that knowledge not merely available, but easily accessible.

In other words, they will need to build a Corporate Data Catalogue that looks and feels a lot like Wikipedia, but which borrows the “like” and “share” concepts from Facebook.

This post first appeared on Forbes TeradataVoice on 11/03/2016.

Is A Picture Worth A Thousand Words? The Truth About Big Data And Visualisation

Wednesday August 24th, 2016

Data visualisation has always been a vital weapon in the arsenal of an effective analyst, enabling complex data sets to be represented efficiently and complex ideas to be communicated with clarity and brevity. And as data volumes and analytic complexity continue to increase in the era of big data and data science, visualisation has come to be regarded as an even more vital technique – with a vast and growing array of new visualisation technologies and products coming to market.

Whilst preparing for an upcoming presentation on the Art of Analytics recently, I had reason to re-visit Charles Minard’s visualisation depicting Napoleon’s disastrous Russian campaign of 1812. In case you aren’t familiar with this seminal work, it is shown below.


This visualisation has been described as “the best statistical graphic ever drawn”. And by no less an authority than Edward Tufte, author of “The Visual Display of Quantitative Information”, the standard reference on the subject for statisticians, analysts and graphic designers.

There are many reasons why Minard’s work is so revered. One reason is that he manages to represent six types of data – geography, time, temperature (more on this in a moment), the course and direction of the movement of the Grande Armée and the number of troops in the field – in only two dimensions.

A second is the clarity and economy that enables the visualisation to speak for itself with almost no additional annotation or elaboration. We can see clearly and at a glance that the Grande Armée set off from Poland with 422,000 men, but returned with only 10,000 – and this only after the “main force” was re-joined by 6,000 men who had feinted northwards, instead of joining the advance on Moscow.


And yet a third reason is that the visualisation was ground-breaking; though flow diagrams like these are named for Irish Engineer Matthew Sankey, he actually used this approach for the first time very nearly 30 years after the Minard visualisation was published. Today, Sankey diagrams are used to understand a wide variety of business phenomena where sequence is important. For example, we can use them to map how customers interact with websites so that we can learn the “golden path” most likely to lead to a high-value purchase – and equally to understand which customer journeys are likely to lead to the abandonment of purchases before checkout.

But even Minard’s model visualisation is arguably partial. Minard shows us the temperature that the Grande Armée endured during the winter retreat from Moscow – inviting us to conclude that this was a significant reason for the terrible losses incurred as the army fell back, as indeed it was.

However, the Russians themselves regarded the winter of 1812 / 1813 as unexceptional – and the conditions certainly did not stop the Cossack cavalry from harrying Napoleon’s retreating forces at every turn. Napoleon’s army was equipped only for a summer campaign – because Napoleon had believed that he could force the war to a successful conclusion before the winter began. As the explorer Sir Rannulph Fiennes has said, “There is no such thing as bad weather, only inappropriate clothing.”

Exceptional weather also affected the campaign’s advance, with a combination of torrential rain followed by extremely hot conditions killing many men from dysentery and heatstroke. But Minard either cannot find a way to represent this information, or chooses not to. In fact, he gives us few clues as to why the main body of Napoleon’s attacking force was reduced by a third during the first eight weeks of the invasion and before the major battle of the campaign – even though, numerically at least, this loss was greater than that suffered during the retreat the following winter.

Terrible casualties also arose from many other sources – with starvation as a result of the Russian scorched earth policy and inadequate supplies playing key roles. The state of the Lithuanian roads is regarded by historians as a key factor in this latter issue, impassable as they were to Napoleon’s heavy wagon trains both after the summer rains and during the winter. But again, Minard either cannot find a way to represent the critical issue (the tonnage of supplies reaching the front line) or its principal cause (the state of the roads) – or chooses not to.

Minard produced this work 50 years after the events it describes, at a time when many in France yearned for former Imperial glories and certainties. His purpose – at least if the author of his obituary is to be believed – seems to have been to highlight the waste of war and the futility of overweening Imperial ambition. It arguably would not have suited his narrative to articulate that Napoleon’s chances of success might have been greater had the Russia of 1812 been a more modern nation with a more modern transport infrastructure – or had Napoleon’s strategy made due allowance for the fact that it was not.

With the benefit of 20th century hindsight, today we might still conclude that the vastness of the Russian interior and the obduracy of Russian resistance would anyway have doomed a better planned and executed campaign; but that hindsight was not available in 1869, either to Minard – or to the contemporaries he sought to influence.

Did Minard’s politics affect his choice of which data to include? Or were the other data simply not available to him? Or beyond his ability to represent in a single figure? From our vantage point 150 years after the fact, it is difficult to answer these questions with certainty.

But when you are looking at a data visualisation, you certainly should attempt to understand the author’s agenda, preconceptions and bias. What is it that the author wants you to see in the data? Which data have been included? Which omitted? And why? Precisely because good data visualisations are so powerful, you should make sure that you can answer these questions before you make a decision based on a data visualisation. Because whilst a good data visualisation is worth a thousand words, it does not automatically follow that it tells the whole truth.

This post first appeared on Forbes TeradataVoice on 31/03/2016.

How Soon Is Now? What Real-Time Analytics Mean For Your Business

Wednesday August 10th, 2016

As customer attention spans get ever shorter, and marketplaces ever more crowded and competitive, real-time and near real-time systems are hot button issues for businesses.

Unfortunately, the question of what actually constitutes “real-time” is rather more vexing than it first appears.

When a merchandiser at a big box retailer talks about “real-time analytics”, for example, he may actually want a sales dashboard that is updated several times a day.


But when a marketing manager at a mobile telco talks about real-time analytics, she may want the capability to automatically send offers to customers within seconds of them tripping a geo-fence.

And her friend in capital markets trading may have expectations of “real-time” systems that are measured in microseconds.

Since appropriate solutions to these different problems typically require very different architectures, technologies and implementation patterns, knowing which “real-time” we are dealing with really matters.

Before You Get Started, Pause to Consider

Real-time systems are often about detecting an event – and then making a smart decision about how to react to it. The Observe-Orient-Decide-Act or “OODA loop” gives us a useful way to model the decision-making process. So what can a business leader do to minimise confusion when engaging with I.T. at the start of a real-time project?

  1. Understand how we will detect the event that we wish to respond to. Sometimes this is trivial. Other times, rather tougher – especially if the “event” we care about is one when something that should happen does not. Or which represents the conjunction of multiple events from across the business.
  2. Clarify who will be making the decision – man, or machine? The marque 2 eyeball has powers of discretion that machines sometimes lack. But its carbon-based owner is not only much slower than a silicon-based system, but is only able to make decisions one-at-a-time, one-after-another. If we chose to put a human in the loop, we are normally in “please-update-my-dashboard-faster-and-more-often” territory.
  3. Being clear about decision latency is also important – how soon after a business event do we need to take a decision? And implement it? We will need also to understand whether decision latency and data latency are the same. Sometimes I can make a good decision now on the basis of older data. But sometimes I need the latest, greatest and most up-to-date information to make the right choices.
  4. Balance the often competing requirements of decision sophistication and data availability. Do we need to leverage more – and potentially older – data to take a good decision? Or can we make a “good enough” decision with less data? 

Can You Have Your Cake and Eat it?

Consider this – I want to send you an offer in near real-time when you are within half-a-mile of a particular store or outlet. I can do so solely on the basis of the fact that you have tripped a geo-fence – which means that the only information I need is your location – where you are right now.

But what if I want first to understand whether I have made the same offer to you before, how you did or didn’t respond, which offers other customers who have previously exhibited similar behaviours to yours have or haven’t responded to in the last six months, etc…? Then I need also to access other data in addition to your current location that may be stored elsewhere, outwith the streaming system.

In this case, the cost of choosing to give you a more sophisticated and personalised offer is the time it takes to fetch and process that data, so “good”, here, may be the enemy of “fast”. In this case, we might need to choose between “OK right now” or “great a little later”. That trade-off is normally very dependent on use-case, channel and application.

Playing the Game

Of course, I can try and game the system – by pre-computing next-best actions for a variety of different scenarios. This way, I can try to be fast and good, by merely fetching the result of a complex calculation made with lots of data in response to an event that I have just detected, instead of actually getting the underlying data and running the numbers.

But then the price I pay is reduced flexibility and increased complexity. And by definition, decision latency and data latency are different where we “cheat” like this, because I’m making the decision based on the data from our previous interactions, not the latest data.

There are different costs and benefits associated with all these options. There is no wrong answer – they are all more or less appropriate in different scenarios. But make sure that you understand your requirements before IT starts evaluating streaming and in-memory technologies.

This post first appeared on Forbes TeradataVoice on 16/03/2016.

The Real Reason Why Google Flu Trends Got Big Data Analytics So Wrong

Wednesday June 8th, 2016

Unless you have just returned to Earth after a short break on Mars, you will have noted that some of the shine has come off the big data bandwagon lately.

Two academic papers that may have escaped your attention can help us to understand why – but also demonstrate that the naysayers are as misguided in their cynicism as are the zealots are in their naïvety. Google Flu Trends (GFT) was once held-up as the prototypical example of the power of big data.

By leveraging search term data – apparently worthless “data exhaust” – a group of Data Scientists with little relevant expertise were able to predict the spread of flu across the continental United States.  In near real-time. At a marginal cost. And more accurately than the “experts” at the Centre for Disease Control with their models built from expensive survey data, available only after the fact.


Except that they weren’t.

We now know that GFT systematically over-estimated cases – and was likely predicting winter, not flu.  The first paper attempts to be even-handed and magnanimous in its analysis of what went wrong – and even succeeds, for the most part – but the label that the authors give to one of the mistakes made by the Google team (“Big Data Hubris”) rather gives the game away.

If revenge is a dish best served cold, then perhaps the statisticians and social scientists can be forgiven their moment of schadenfreude at the expense of the geeks who dared to try and steal their collective lunch. Revenge aside, this matters. Because it goes to the heart of a debate about how we should go about the business of extracting insight and understanding from big data.

Traditional approaches to analytics – what you might call the “correlation is not causality” school – have emphasised the importance of rigorous statistical method and understanding of the problem space.

By contrast, some of what we might characterise as the “unreasonable effectiveness of data” crowd have gone so far as to claim that understanding is over-rated – and that with a big enough bucket of data, there is no question that they can’t answer, even if it is only “what” that is known, not “why”.

All of which is what makes Lynn Wu and Erik Brynjolfon’s 2013 revision of a paper they first wrote in 2009 so important.  Wu and Brynjolfson also set themselves the task of leveraging search term data – this time to predict U.S. house prices – but instead of discarding the pre-existing transaction data, they used the data exhaust to create new features to enhance an existing model.

This is big data as extend-and-enhance, not rip-and-replace.  And it works – Wu and Brynjolfson succeeded in building a predictive model for real estate pricing that out-performed the experts of the National Association of Realtors by a wide margin.

All of which might sound interesting, but also a little worthy and academic.  What can we learn from all of this about the business of extracting insight and understanding from data in business?

Plenty.  If you are a bank that wants to build a propensity-to-buy model to understand which products and services to offer to digital natives, then leverage clickstream data.  But use it extend a traditional recency / frequency / spend / demography-based model, not replace it.

If you are an equipment maker seeking to predict device failure using “Internet of Things” sensor data that describe current operating conditions and are streamed in near real-time, you can bet that a model that also accounts for equipment maintenance and manufacture data will out-perform one that does not.

And if you are leading a big data initiative, you should prioritise integrating any new technologies that you deploy to build a Data Lake with your existing Data Warehouse, so that you can connect your “transaction” data with your “interaction” data.

Because if we are not to make the same mistakes as Google Flu Trends, then we need to face up to the fact that big data is about “both and”, not “either / or”.

This blog first appeared on Forbes Teradata Voice on 03/04/2016.

What is a “Data Lake” Anyway?

Monday February 23rd, 2015

One of the consequences of the hype and exaggeration that surrounds Big Data is that the labels and definitions that we use to describe the field quickly become overloaded. One of the Big Data concepts that presently we risk over-loading to the point of complete abstraction is the “Data Lake”.

Data Lake discussions are everywhere right now; to read some of these commentaries, the Data Lake is almost the prototypical use-case for the Hadoop technology stack. But there are far fewer actual, reference-able Data Lake implementations than there are Hadoop deployments – and even less documented best-practice that will tell you how you might actually go about building one.

So if the Data Lake is more architectural concept than physical reality in most organisations today, now seems like a good time to ask: What is a Data Lake anyway? What do we want it to be? And what do we want it not to be?

UTL-1010-L

 

 

 

 

 

 

 

 

 

 

 

When you cut through the hype, most proponents of the Data Lake concept are promoting three big ideas:

1) It should capture all data in a centralized, Hadoop-based repository (whatever all means)

2) It stores the data in a raw, un-modelled format

3) And that doing so will enable you to break down the barriers that still inhibit end-to-end, cross-functional Analytics in too many organisations

Now those are lofty and worthwhile ambitions, but at this point many of you could be forgiven a certain sense of déjà vu – because improving data accessibility and integration are what many of you thought you were building the Data Warehouse for.

In fact, many production Hadoop applications are built according to an application-specific design pattern, rather than an application-neutral one that allows multiple applications to be brought to a single copy of data (in technical jargon, this is called a “star schema” design pattern). And whilst there is a legitimate place in most organizations for at least some application-specific data stores, far from breaking down barriers to Enterprise-wide Analytics, many of these solutions risk creating a new generation of data silos.

A few short years after starting their Hadoop journey, a leading Teradata customer has already deployed more than twenty sizeable application-specific Hadoop clusters. That is not a sustainable trajectory and we’ve seen this movie before. In the 90s and the 00s, many organisations deployed multiple data mart solutions that ultimately had to be consolidated into data warehouses. These consolidation projects cut costs and added value but also sucked up resources – human, financial and organisational which delayed the delivery of net new Analytic applications. That same scenario is likely to play out for the organisations deploying tens of Hadoop clusters today.

We can do better than this – and we should be much more ambitious for the Hadoop technology stack, too.

Put simply, we need to decide whether we are building data lakes to try and deliver existing functionality more cost effectively or whether we are building them to deliver net new analytics and insights for our organisations. While, of course, we should always try to optimise the cost of processing information in our organisations, the bigger prize is to leapfrog the competition.

I recently attended a big data conference where a leading European bank was discussing the applications it had built on its Hadoop-based data lake. Whilst some of these applications were clearly interesting and adding value to the organisation, I was left with the clear impression that they could easily have been delivered from infrastructure and solutions that the bank had already deployed.

The kind of advanced analytics that Hadoop and related technologies make possible are already enabling some leading banks to address some of the well-publicised difficulties that large European banks find themselves faced with. Text analytics can help the bank understand which customers are complaining and what they are complaining about. Graph analytics can pinpoint fraudulent trading patterns and collusion between rogue traders. Path analytics can highlight whether employees are correctly complying with regulatory processes. So I have to conclude that this organisation’s use of the technology to re-invent the wheel was a wasted opportunity.

The Data Lake isn’t quite yet the prototypical use-case for Hadoop that some of the hype would have you believe. But it will be. Application-specific star schema-based data stores be damned; this is what Google and Doug Cutting gave us Hadoop for.

This post first appeared on TeradataVoice on Forbes on 11 Dec 2014.

Big Data: not unprecedented but not bunk either – part IV

Thursday October 30th, 2014

In the course of the Big Data blog series, I have tried to identify what Big Data really means, the challenges that organisations which have been successful in exploiting it have overcome – and how the consequences of these challenges are re-shaping Enterprise Analytic Architectures. Now I want to take a look at two of the key questions that we at Teradata – and indeed the Industry at large – will have to address in the coming months and years as distributed architectures become the norm.

The rise of the “Logical Data Warehouse” architectural pattern is a direct consequence of the five key Big Data challenges that I discussed in part 3 of this series of blogs. Precisely because there is no single Information Management strategy – never mind a single Information Technology – that addresses all five challenges equally well, it is increasingly clear that the future of Enterprise Analytical Architecture is plural and that organisations will need to deploy and integrate multiple Analytic platforms.

The first key question, then, that the Industry has to answer is: what types of platforms – and how many of them?

Actually, whilst that formulation is seductively simple, it’s also flawed. So much of the Big Data conversation is driven by technology right now that the whole industry defaults to talking about platforms when we should really discuss capabilities. Good Enterprise Architecture, after all, is always, always, always business requirements driven. To put the question of platforms before the discussion of capabilities is to get things the wrong way around.

So let’s re-cast that first question: how many and which type of Analytic capabilities?

At Teradata, we observe that the leading companies that we work with which have been most successful in exploiting Big Data increasingly use manufacturing analogies to discuss how they manage information.

In manufacturing, raw materials are acquired and are subsequently transformed into a finished product by a well-defined manufacturing process and according to a design that has generally been arrived at through a rather less well-defined and iterative Research and Development (R&D) process.

Listen carefully to a presentation by a representative of a data-driven industry leader – the likes, for example, of Apple, eBay, Facebook, Google, Netflix or Spotify – and time-and-again you will hear them talk about three key capabilities: the acquisition of raw data from inside and outside the company; research or “exploration” that allows these data to be understood so that they can be exploited; and the transformation of the raw data into a product that business users can understand and interact with to improve business process. When you get all done, the companies that compete on Analytics focus on doing three things well: data acquisition; data R&D; and data manufacturing. Conceptually at least, 21st century organisations need three Analytic capabilities to address the five challenges that we discussed in part 3 of this blog, as represented in the “Unified Data Architecture” model reproduced below.

UDA_1_Oct 2014

It is important to note at this point that it doesn’t necessarily follow that a particular organisation should automatically deploy three (and no more) Analytical platforms to support these three capabilities. The “staging layers” and “data labs” in many pre-existing Data Warehouses (a.k.a.: Data Manufacturing), for example, are conceptually similar to the “data platform” (a.k.a.: Data Acquisition) and “exploration and discovery platform” (a.k.a.: Data R&D) in the Unified Data Architecture model shown above – and plenty of organisations will find that they can provide one or more of the three capabilities via some sort of virtualised solution. And plenty more will be driven to deploy multiple platforms where conceptually one would do, by, for example, political concerns or regulatory and compliance issues that place restrictions on where sensitive data can be stored or processed. As is always the case, mapping a conceptual architecture to a target physical architecture requires a detailed understanding of functional and non-functional requirements and also of constraints. A detailed discussion of that process is not only beyond the scope of this blog – it also hasn’t changed very much in the last several years, so that we can safely de-couple it from the broader questions about what is new and different about Big Data. Functional and non-functional requirements continue to evolve very rapidly – and some of the constraints that have traditionally limited how much data we can store, for how long and what we can do with it have been eliminated or mitigated by some of the new Big Data technologies. But the guiding principles of Enterprise Architecture are more than flexible enough to accommodate these changes.

So much for the first key question; what of the second? Alas, “the second question” is also a seductive over-simplification – because rather than answer a single second question, the Industry actually needs to answer four related questions.

Deploying multiple Analytical platforms is easy. Too easy, in fact – anticipate a raft of Big Data repository consolidation projects during the next decade in exactly the same way that stovepipe Data Mart consolidation projects have characterized the last two decades. It is the integration of those multiple Analytical platforms that is the tricky part. Wherever we deploy multiple Analytical systems, we need to ask ourselves:

a) How will multiple, overlapping and redundant data sets be synchronised across the multiple platforms? For example, if I want to store 2 years history of Call Detail Records (CDRs) on the Data Warehouse and 10 years history of the same data on a lower unit-cost online archive technology, how do I ensure that the overlapping data remain consistent with one another?

b) How do I provide transparent access to data for end-users? For example and in the same scenario: if a user has a query that needs to access 5 years of CDR history, how do I ensure that the query is either routed or federated correctly so that the right answer is returned – without the user having to understand either the underlying structure or distribution of the data?

c) How do I manage end-to-end lineage and meta-data? To return to the manufacturing analogy: if I want to sell a safety critical component – the shielding vessel of a nuclear reactor, for example – I need to be able to demonstrate that I understand both the provenance and quality of the raw material from which it was constructed and how it was handled at every stage of the manufacturing process. Not all of the data that we manage are “mission-critical”; but many are – and many more are effectively worthless if we don’t have at least a basic understanding of where they came from, what they represent and how they should be interpreted. Governance and meta-data – already the neglected “ugly sisters” of Information Management – are even more challenging in a distributed systems environment.

d) How do I manage the multiple physical platforms as if they were a single, logical platform? Maximising availability and performance of distributed systems requires that we understand the dependencies between the multiple moving parts of the end-to-end solution. And common integrated administration and management tools are necessary to minimize the cost of IT operations if we are going to “square the circle” of deploying multiple Analytical platforms even as IT budgets are flat – or falling.

At Teradata, our objective is to lead the industry in this evolution as our more than 2,500 customers adapt to the new realities of the Big Data era. That means continuing to invest in Engineering R&D to ensure that we have the best Data, Exploration and Discovery and Data Warehouse platforms in the Hadoop, Teradata-Aster and Teradata technologies, respectively; witness, for example, the native JSON type that we have added to the Teradata RDBMS and the BSP-based Graph Engine and Analytic Functions that we have added to the Teradata-Aster platform already this year. It means developing and acquiring existing and new middleware and management technologies like Teradata Unity, Teradata QueryGrid, Revelytix and Teradata Viewpoint to address the integration questions discussed in this blog. And it means growing still further our already extensive Professional Services delivery capabilities, so that our customers can concentrate on running their businesses, whilst we provide soup-to-nuts design-build-manage-maintain services for them. Taken together, our objective is to provide support for any Analytic on any data, with virtual computing to provide transparent orchestration services, seamless data synchronization – and simplified systems management and administration.

If our continued leadership of the Gartner Magic Quadrant for Analytic Database Management Systems is any guide, our Unified Data Architecture strategy is working. More importantly, more and more of our customers are now deploying Logical Data Warehouses of their own using our technology. Big Data is neither unprecedented, nor is it bunk; to paraphrase William Gibson “it’s just not very evenly distributed”. By making it easier to deploy and exploit a Unified Data Architecture, Teradata is helping more-and-more of our customers to compete effectively on Analytics; to be Big Data-Driven.

Big Data: not unprecedented but not bunk either – Part III

Tuesday September 9th, 2014

In my last post in this series, I explained the five big challenges that organisations must address in order to successfully work with Big Data. Between them, these five challenges are combining to drive the most significant evolution in Enterprise Analytical Architecture since Devlin, Inmon, Kimball et. al. gave the world the Enterprise Data Warehouse. Contrary to some of the more breathless industry hype, thirty years of Information Management best-practice has not been rendered obsolete overnight. But we should increasingly regard the Data Warehouse as necessary, but no longer sufficient by itself.

Where data are re-used we need to minimize the Total Cost of Ownership by amortising the (considerable) acquisition and integration costs over multiple business processes, by bringing multiple Analytical Applications to one copy of the data, rather than the other way around. Where data supports mission-critical business processes, it needs to be accurate, reliable and certified (and one copy is better than two, because a man with one watch knows the time – but a man with two watches is never quite sure). And where we want to optimise end-to-end business processes (rather than merely spin the plates faster in a particular department), we need to integrate data to support cross-functional Analytics. These considerations – in large part the motivation for the original Data Warehouse concept in the first place – are dominant when we seek to operationalise Analytics (the final challenge of the five that I identified in my last post) by sharing actionable insights across the organisation and across functional, organisational and geographical boundaries. Because deploying an Integrated Data Warehouse is still the most rational way to address them, rumours of its demise have been very seriously exaggerated. And because parallel RDBMS platforms are still the only technologies with the proven elastic and multi-dimensional scalability required to support a complex mix of workloads, they are still the only game in town when it comes to bringing multiple Analytical Applications to one copy of the organisation’s (structured) data assets.

Big Data challenges one-through-four, however, increasingly require that we augment the Data Warehouse with new architectural constructs that in many cases are best deployed on new technologies. A “data platform” or “data lake”, for example – built on a technology with a lower unit cost of storage than a data warehousing platform, which is designed and optimised for high-performance sharing of data – can enable organisations to address the economic challenge of capturing large and noisy data sets of unproven value. Distributed filesystem technologies may be a more natural fit for capturing complex, multi-structured data – and “late binding” multiple, different schemas to them – than a Relational Database Management System (RDBMS). And technologies designed from the ground-up to support time series, path and graph Analytics can offer important ease-of-use and performance advantages for the complex analysis of interaction data modelled as a network or a graph.

Leading analyst firm Gartner has coined the term “Logical Data Warehouse” to describe the evolution from what we might term “monolithic” to more distributed Data Warehouse architectures. Whatever label we apply to this evolution – and at Teradata, we prefer “Unified Data Architecture” – it is clear that the future of enterprise Analytical Architecture is plural. We will increasingly need to deploy and integrate multiple analytic platforms, each optimized to address different combinations of Big Data challenges one-through-five which I outlined in my last post, and are laid out in the figure below.

Graphic_5 BD Challenges

Some Analysts and commentators predict that all of this means trouble for Teradata. Their logic goes something like this: Teradata led the industry when the dominant architectural pattern was the Integrated Data Warehouse; increasingly it won’t be the dominant, or at any rate, the only architectural pattern – and so Teradata will no longer continue to be a 500 pound Gorilla in the Analytics Jungle.

You wouldn’t expect me to agree with that particular assessment. And I don’t, for two reasons.

The first flaw in this argument is that it pre-supposes that the Integrated Data Warehouse architectural pattern is going away. And as we have already discussed, the new technologies and architectures are extending it, not replacing it.

The second flaw in this argument is that it ignores the fact that Teradata is leading the industry’s adaptation to the realities of the three “new waves” of Big Data, with new platform and integration technologies that are enabling leading organisations to actually deploy Logical Data Warehouse architectures – to “walk the walk” whilst our competitors merely “talk the talk”.

Before we get too caught up on technology, however, we should remember that Enterprise Architecture – good Enterprise Architecture, anyway – is conceptual, rather than physical. That being the case, just what does a “Logical Data Warehouse” architecture look like, what are its key components – and how does it address the five challenges that we have already described? I will try and tackle these questions in my next post.

Big Data: not unprecedented; but not bunk, either – Part II

Wednesday July 30th, 2014

In the first post of this series, I tried to describe the Big Data phenomenon and to explain how effectively exploiting the three “new waves” of Big Data has enabled organisations like Amazon, eBay, LinkedIn and Netflix to prosper.  I also made the case that whilst a lot of what we have learned over the course of the last 35 years about Information Management and Analytics is as relevant as ever, Big Data are also associated with some new tests.  In this second instalment, I will try and define these issues.

The organisations that we at Teradata have worked with who have been succeeded in moving beyond the analysis of transactions-and-events to interactions-and-observations have all mastered five key challenges, which I discuss in detail below.

(1)     The multi-structured data challenge. The transaction and event data that we have captured, integrated and analysed in traditional Data Warehouses and Business Intelligence applications for the last three decades is largely well-formed, record-oriented – and defined in terms of an explicit schema.  The same is not always true of the new sources of Big Data.  Clickstream, social and machine log data are often characterised by their volatility: the schema or information model that we use to understand them may be implicit rather than explicit; it may be “document-oriented”, meaning that it may (or may not) include some level of hierarchical organisation; it may change continuously; or we may want to apply multiple different interpretations to the data at run-time (“schema on read”), depending on the use-case and application.  Generations of budding young Business Systems Analysts – I was once one of them, although it seems a long time ago now! – were taught that business processes change continuously, but that data and their relationships do not, so model the data.  Many of the new Big Data test this maxim to destruction and make traditional approaches to data integration (which require that we apply a relatively rigid and inflexible “schema on load” to data as it enters the Analytic environment) unproductive.

(2)     The iterative Analytics challenge. Interactions – whether between people and things, people and people, or things and things – describe networks or graphs.  Many – arguably even most – useful analyses of interaction data are characterised by operations in which record order is important.  Time-series, path and graph analytics are all problematical, in varying degrees, for ANSI-standard SQL technologies, as they are based on the relational model and set theory, in which the order of records has no meaning.  The various extensions to ANSI-standard SQL that have been proposed over the years to address these limitations – among them User Defined Functions (UDFs) and Ordered Analytical OLAP functions – are only a partial solution, especially because the requirement to process multi-structured data means that we will not always know when a function is written the precise schema of the data that it will need to process.  The net-net is that these queries are often difficult to express in ANSI standard SQL – and may be computationally expensive to run on platforms optimised for set-based processing, even if we are successful in doing so.

(3)     The noisy data challenge. Some of the new Big Data sets are large and noisy; getting larger quickly; infrequently accessed to support processing which is – at least today – associated with relatively relaxed Service Level Goals; and of unproven value. Organisations faced with capturing large and growing volumes of data in which the useful signal is accompanied by an even larger volume of data that represent extraneous noise to most of the organisation – but that may represent gold dust to a small and select group of Data Scientists – are naturally highly incentivised to look for cost-effective models for storing and processing these data.

(4)     The “there might be a needle in this haystack – but if it takes 12 months and 500,000 EURO to find out, I don’t have the time and money to even go look” challenge.  Many organisations instinctively understand there is value in the new Big Data sets, but aren’t yet sure where to look for it.  The same traditional approaches to Data Integration (model the source systems; develop a new, integrated target data model; map the source models to the target model; develop ETL processes that reliably and accurately capture and transform source system data to the target model) that are often already problematical where the capture of multi-structured data are concerned are doubly problematical in these scenarios, because of the time and cost that they place between Data Scientists and access to the new data.  The costs of acquiring, cleansing, normalising and integrating data have been estimated to represent up to 70% of the total cost of deploying an Analytical database – and Extract-Transform-Load (ETL) project timescales are often measured in months and even years.  Where the data concerned will be re-used widely throughout the organisation (and possibly even beyond it) and will support mission critical business processes, or regulatory reporting, or both, this is a cost that is worth paying – and that is anyway cheaper than the alternatives.   When we want, however, not to reliably ask and answer questions but to explore new data sets to understand if they will enable us to pose new questions that are worth answering, we may need a different approach to acquiring data that provides “good enough” data quality – and that values speed and flexibility over ceremony and repeatability.  In these “exploration and discovery” scenarios, we experiment continuously on data to identify hypotheses worth testing and to identify new sources of data that drive insight.  Since many – even most – of these experiments will fail, productivity and cycle time are critical considerations; if a particular analysis on a new data set is to prove unproductive, we need to establish this early in the process so that we can “fail fast” and move on.  Only for those experiments that succeed – and that will form the basis of Analytics that need to be repeated and shared – will we consider bringing the data concerned through the traditional data integration process.

(5)     The getting past “so what?” and delivering value challenge.  I attend a lot of “Big Data” conferences and events and am constantly amazed at how many vendors – and Analysts who probably should know better – intone from the stage that “the objective of a Big project is to gain new insight about the business”.  Life would be simpler and even more beautiful if that were the case – but of course, those vendors and Analysts are only half right, because our objective must be to use that insight to change the business and so drive return on investment (ROI).  As one of my former bosses once memorably put it: “old business process + expensive new technology = expensive, old business process.” Operationalising the insights gleaned from Analytic experiments will often require that we “productionise” the data and the Analytics concerned, so that we can reliably and accurately share new KPIs, measures, events and alerts throughout the business. As increasingly important as they are to any business, Data Scientists don’t run the business – managers, clerks, customer service representatives, logistics supervisors, etc., etc. do – and insight that is not made actionable and shared beyond outside the rarefied atmosphere of a dusty Data Lab somewhere inside the walls of the Corporate Ivory Tower will not enable them to do their jobs any better than they did before.

These five key challenges and their consequences are driving the most far-reaching evolution in Enterprise Analytical Architecture since Devlin, Inmon, Kimball et. al. gave the world the Enterprise Data Warehouse.  And that is the subject of next week’s post.

Big Data: not unprecedented; but not bunk, either

Wednesday July 23rd, 2014

Larry Ellison, Oracle’s flamboyant CEO, once remarked that “the computer industry is the only industry that is more fashion-driven than women’s apparel”.  The industry’s current favourite buzzword – “Big Data” – is so hyped that it has crossed over from the technology lexicon and entered the public consciousness via mainstream media.  In the process, it has variously been described as both “unprecedented” and “bunk”.

So is this all just marketing hype, intended to help vendors ship more product?  Or is there something interesting going on here?

To understand why the current Big Data phenomenon is not unprecedented, recall that Retailers, to take just one example, have lived through not one but two step-changes in the amount of information that their operations produce in less than three decades, as first EPoS systems and later RFID technology transformed their ability to analyse, understand and manage their operations.

As a simple example, Teradata shipped the world’s first commercial Massively Parallel Processing (MPP) system with a Terabyte of storage to Kmart in 1986.  By the standards of the day this was an enormous system (it filled an entire truck when shipped) that enabled Kmart to capture sales data at the store / SKU / day level – and to revolutionise the Retail industry in the process.  Today the laptop that I am writing this blog on has a Terabyte of storage – and store / SKU / transaction level data is table-stakes for a modern Retailer trying to compete with Walmart’s demand-driven supply chain and Amazon’s sophisticated customer behavioural segmentation.  Similar analogies can be drawn for the impact of billing systems and modern network switches in telecommunications, branch automation and online banking systems in retail finance etc., etc., etc.

The reality is that we have been living with exponential growth in data volumes since the invention of the modern digital computer, as the inexorable progress of Moore’s law has enabled more and more business processes to be digitized.  And anxiety about how to cope with perceived “information overload” predates even the invention of the modern digital computer.  The eight years that it took hard-pressed human calculators to process the data collected for the 1880 U.S. census was the motivation for the invention of the “Hollerith cards” by Herman Hollerith, founder of the Hollerith’s Tabulating Machine Company – which later became International Business Machines (IBM).

Equally I would argue that it is a mistake to dismiss Big Data as “bunk”, because significant forces are currently re-shaping the way organisations think about Information and Analytics. These forces were unleashed, beginning in the late 1990s, by three disruptive technological innovations that have produced seismic shocks in business and society; three new waves of Big Data have been the result.

The first of these shocks was the rise (and rise, and rise) of the World Wide Web, which enabled Internet champions like Amazon, eBay and Google to emerge.  These Internet champions soon began to dominate their respective marketplaces by leveraging low-level “clickstream” data to enable “mass customisation” of their websites, based on sophisticated Analytics that enabled them to understand user preferences and behaviour.  If you were worried that my use of “seismic shock” in the previous paragraph smacked of hyperbole, know that some commentators are already predicting that Amazon – a company that did not exist prior to 1995 – may soon be the largest retailer in the world.

Social Media technologies – amplified and accelerated by the impact of increasingly sophisticated and increasingly ubiquitous mobile technologies – represent the second of these great disruptive forces.  The data they generate as a result are increasingly enabling organisations to understand not just what we do, but where we do it, how we think, and who we share our thoughts with.  LinkedIn’s “people you might know” feature is a classic example of this second wave of Big Data, but in fact even understanding indirect customer interactions can be a huge source of value to B2C organisations – witness the “collaborative filtering” graph Analytics techniques that underpin the increasingly sophisticated recommendation engines that have underpinned much of the success of the next-generation Internet champions, like Netflix.

The “Internet of Things” – networks of interconnected smart devices that are able to communicate with one another and the world around them – is the third disruptive technology-led force to emerge in only the last two decades.  Its ramifications are only now beginning to become apparent.  A consequence of the corollary of Moore’s Law – simple computing devices are now incredibly inexpensive and fast becoming more so – the Internet of Things is leading to the instrumentation of more and more everyday objects and processes. The old saw that “what gets measured gets managed” is increasingly redundant as we enter an era in which rugged, smart, – and above all, cheap – sensors will effectively make it possible to measure anything and everything.

We can crudely characterize the three “new waves” of Big Data that have accompanied these seismic shocks as enabling us to understand, respectively: how people interact with things; how people interact with people; and how complex systems of things interact with one another.  Collectively, the three new waves make it possible for Analytics to evolve from the study of transactions to the study of interactions and observations; where once we collected and integrated data that described transactions and events and then inferred behaviour indirectly, we can increasingly measure and analyse the behaviour – of systems as well as of people – directly. In an era of hyper-competition – itself a product of both globalisation and digitisation – effectively analysing these new sources of data and then taking action on the resulting insight to change the way we do business can provide organisations with an important competitive advantage, as the current enthusiasm for Data Science also testifies.

Contrary to some of the more breathless industry hype, much of what we have learnt about Information Management and Analytics during the last three decades is still relevant – but effectively exploiting the three “new waves” of Big Data also requires that we master some new challenges.  And these are the subject of part 2 of this blog, coming soon.