In the course of the Big Data blog series, I have tried to identify what Big Data really means, the challenges that organisations which have been successful in exploiting it have overcome – and how the consequences of these challenges are re-shaping Enterprise Analytic Architectures. Now I want to take a look at two of the key questions that we at Teradata – and indeed the Industry at large – will have to address in the coming months and years as distributed architectures become the norm.
The rise of the “Logical Data Warehouse” architectural pattern is a direct consequence of the five key Big Data challenges that I discussed in part 3 of this series of blogs. Precisely because there is no single Information Management strategy – never mind a single Information Technology – that addresses all five challenges equally well, it is increasingly clear that the future of Enterprise Analytical Architecture is plural and that organisations will need to deploy and integrate multiple Analytic platforms.
The first key question, then, that the Industry has to answer is: what types of platforms – and how many of them?
Actually, whilst that formulation is seductively simple, it’s also flawed. So much of the Big Data conversation is driven by technology right now that the whole industry defaults to talking about platforms when we should really discuss capabilities. Good Enterprise Architecture, after all, is always, always, always business requirements driven. To put the question of platforms before the discussion of capabilities is to get things the wrong way around.
So let’s re-cast that first question: how many and which type of Analytic capabilities?
At Teradata, we observe that the leading companies that we work with which have been most successful in exploiting Big Data increasingly use manufacturing analogies to discuss how they manage information.
In manufacturing, raw materials are acquired and are subsequently transformed into a finished product by a well-defined manufacturing process and according to a design that has generally been arrived at through a rather less well-defined and iterative Research and Development (R&D) process.
Listen carefully to a presentation by a representative of a data-driven industry leader – the likes, for example, of Apple, eBay, Facebook, Google, Netflix or Spotify – and time-and-again you will hear them talk about three key capabilities: the acquisition of raw data from inside and outside the company; research or “exploration” that allows these data to be understood so that they can be exploited; and the transformation of the raw data into a product that business users can understand and interact with to improve business process. When you get all done, the companies that compete on Analytics focus on doing three things well: data acquisition; data R&D; and data manufacturing. Conceptually at least, 21st century organisations need three Analytic capabilities to address the five challenges that we discussed in part 3 of this blog, as represented in the “Unified Data Architecture” model reproduced below.
It is important to note at this point that it doesn’t necessarily follow that a particular organisation should automatically deploy three (and no more) Analytical platforms to support these three capabilities. The “staging layers” and “data labs” in many pre-existing Data Warehouses (a.k.a.: Data Manufacturing), for example, are conceptually similar to the “data platform” (a.k.a.: Data Acquisition) and “exploration and discovery platform” (a.k.a.: Data R&D) in the Unified Data Architecture model shown above – and plenty of organisations will find that they can provide one or more of the three capabilities via some sort of virtualised solution. And plenty more will be driven to deploy multiple platforms where conceptually one would do, by, for example, political concerns or regulatory and compliance issues that place restrictions on where sensitive data can be stored or processed. As is always the case, mapping a conceptual architecture to a target physical architecture requires a detailed understanding of functional and non-functional requirements and also of constraints. A detailed discussion of that process is not only beyond the scope of this blog – it also hasn’t changed very much in the last several years, so that we can safely de-couple it from the broader questions about what is new and different about Big Data. Functional and non-functional requirements continue to evolve very rapidly – and some of the constraints that have traditionally limited how much data we can store, for how long and what we can do with it have been eliminated or mitigated by some of the new Big Data technologies. But the guiding principles of Enterprise Architecture are more than flexible enough to accommodate these changes.
So much for the first key question; what of the second? Alas, “the second question” is also a seductive over-simplification – because rather than answer a single second question, the Industry actually needs to answer four related questions.
Deploying multiple Analytical platforms is easy. Too easy, in fact – anticipate a raft of Big Data repository consolidation projects during the next decade in exactly the same way that stovepipe Data Mart consolidation projects have characterized the last two decades. It is the integration of those multiple Analytical platforms that is the tricky part. Wherever we deploy multiple Analytical systems, we need to ask ourselves:
a) How will multiple, overlapping and redundant data sets be synchronised across the multiple platforms? For example, if I want to store 2 years history of Call Detail Records (CDRs) on the Data Warehouse and 10 years history of the same data on a lower unit-cost online archive technology, how do I ensure that the overlapping data remain consistent with one another?
b) How do I provide transparent access to data for end-users? For example and in the same scenario: if a user has a query that needs to access 5 years of CDR history, how do I ensure that the query is either routed or federated correctly so that the right answer is returned – without the user having to understand either the underlying structure or distribution of the data?
c) How do I manage end-to-end lineage and meta-data? To return to the manufacturing analogy: if I want to sell a safety critical component – the shielding vessel of a nuclear reactor, for example – I need to be able to demonstrate that I understand both the provenance and quality of the raw material from which it was constructed and how it was handled at every stage of the manufacturing process. Not all of the data that we manage are “mission-critical”; but many are – and many more are effectively worthless if we don’t have at least a basic understanding of where they came from, what they represent and how they should be interpreted. Governance and meta-data – already the neglected “ugly sisters” of Information Management – are even more challenging in a distributed systems environment.
d) How do I manage the multiple physical platforms as if they were a single, logical platform? Maximising availability and performance of distributed systems requires that we understand the dependencies between the multiple moving parts of the end-to-end solution. And common integrated administration and management tools are necessary to minimize the cost of IT operations if we are going to “square the circle” of deploying multiple Analytical platforms even as IT budgets are flat – or falling.
At Teradata, our objective is to lead the industry in this evolution as our more than 2,500 customers adapt to the new realities of the Big Data era. That means continuing to invest in Engineering R&D to ensure that we have the best Data, Exploration and Discovery and Data Warehouse platforms in the Hadoop, Teradata-Aster and Teradata technologies, respectively; witness, for example, the native JSON type that we have added to the Teradata RDBMS and the BSP-based Graph Engine and Analytic Functions that we have added to the Teradata-Aster platform already this year. It means developing and acquiring existing and new middleware and management technologies like Teradata Unity, Teradata QueryGrid, Revelytix and Teradata Viewpoint to address the integration questions discussed in this blog. And it means growing still further our already extensive Professional Services delivery capabilities, so that our customers can concentrate on running their businesses, whilst we provide soup-to-nuts design-build-manage-maintain services for them. Taken together, our objective is to provide support for any Analytic on any data, with virtual computing to provide transparent orchestration services, seamless data synchronization – and simplified systems management and administration.
If our continued leadership of the Gartner Magic Quadrant for Analytic Database Management Systems is any guide, our Unified Data Architecture strategy is working. More importantly, more and more of our customers are now deploying Logical Data Warehouses of their own using our technology. Big Data is neither unprecedented, nor is it bunk; to paraphrase William Gibson “it’s just not very evenly distributed”. By making it easier to deploy and exploit a Unified Data Architecture, Teradata is helping more-and-more of our customers to compete effectively on Analytics; to be Big Data-Driven.