This is the third instalment of a post examining the rather gushing and uncritical – not to mention re-cycled - coverage of the SAP HANA "in-memory" database technology in Scott M Fulton's recent article entitled: "SAP's HANA: Accelerating Your Apps by 6 Orders of Magnitude". In the first and second instalments I reviewed some of the challenges concerned with the development of in-memory database technology and briefly discussed the economics of storing all data in-memory. In this third and final instalment I will discuss SAP's apparent lack of commitment to Enterprise – rather than departmental – analytics, and the likely consequences for the way that HANA will be deployed in practice.
How do you like your redundancy? Macro - or micro?
Fulton quotes SAP's Global Solutions President, Sanjay Poonen, as claiming that data warehouses built on Teradata database technology result in the storage and movement of large volumes of redundant data. By contrast, SAP's claim is that because memory accesses are (a) fast and (b) consistent - and because HANA will be at the centre of a vertically integrated SAP application stack - there will be no need for indexes and summary / aggregate tables, so that redundancy will be eliminated. However, in the very few benchmark situations where we have seen HANA head-to-head, our customers have reported to us that SAP consultants have extensively optimised the physical data model – often deploying between five and ten indexes on each table - and in some cases have had to pre-join tables just to get queries to run. Despite this extensive tuning, those same customers have reported to us that performance of complex, ad-hoc queries in particular has been (a) poor and (b) variable, with HANA unable to meet the Service Level Goals (SLG) specified in the benchmark tests (in all cases, we have run those same tests on Teradata 2690 Data Warehouse Appliance systems and successfully met or bettered those same SLG requirements). It turns out that faster storage access cannot compensate for weak cost-based optimisation and poor query planning.
(This is why, incidentally, the very selective "benchmark" details that SAP has put into the public domain all refer to star schemas – because a simple physical data model in which data are pre-joined and complex relationships are simplified is kinder to an unsophisticated DBMS optimiser. Of course, this pre-joining of data can also make more complex queries less efficient, or even impossible, which is why SAP architected it's own flagship Data Warehouse solution – Business Warehouse (BW) - to include an integrated data layer, as well as the extended "InfoCube" star schemas. Star Schemas are great for providing a simplified view of complex data for end-users doing basic reporting and for optimising the performance of basic reports, where access paths are known in advance - but they have serious limitations for other important forms of analysis.)
In fact, Poonen's assertion is anyway essentially inaccurate, precisely because we have worked very hard over the course of the last 10-15 years – with the support of partners like SAP BusinessObjects – to move both data integration and data manipulation processing into the database, wherever this is possible.
This was not always the case. Back in 1999 when I was a jobbing Data Warehouse manager trying to deliver a new report that required the derivation of a moderately complex metric based on data stored in three separate tables, BusinessObjects would try and copy all three tables to the client and try to join them there, ignoring the fact that the Teradata data warehouse platform where the data was stored was a sophisticated parallel computing platform designed for the very purpose of joining and combining data – and that the client was a Windows98 PC at the end of a mediocre Local Area Network. These days, any BI tool worth its salt – including BusinessObjects – instead creates a derived temporary table in the database to join the data from the three tables and calculate the derived metric "in database". The only things that move across the LAN or WAN, mediocre or otherwise, are an SQL request in one direction – and an answer set in the other. There may be some "micro redundancy" of data in the Teradata database in the form of indexes and "semantic layer" summary and aggregate tables (we call these "raw data extensions"), but these constructs are typically used only sparingly in Teradata systems. And because they make the Data Warehouse data structures more comprehensible - so that more end-users are able to make sense of and explore the data - they are a price well worth paying.
By contrast I would contend that SAP has arguably the industry's worst record on analytic data redundancy, bar none. Not only is data stored redundantly in each Business Warehouse instance in the form of multiple, overlapping InfoCube objects - an attempt to try and alleviate BW's performance and scalability issues, which ironically results in the load performance issues that I discussed in the first instalment of this post – but SAP customers are typically also forced to deploy multiple, overlapping BW systems. In some cases this is because they are driven to partition their analytic data in an attempt to address the aforementioned BW performance and scalability issues; in other cases this is a consequence of the fact that some SAP applications require the deployment of specific BW instances. The reality is that BW is not so much an Enterprise Data Warehouse solution; rather it is a platform for the deployment of multiple, application-specific Data Marts.
As an example of how BW is typically actually deployed in practice, I spoke to an analyst from a leading research and consultancy firm recently who told me about an organization that had found it necessary to deploy seventy-five separate, customised BW instances. Yes, you read that correctly. Seventy-five. This is "macro redundancy". On steroids. And next to it, a 25% raw data extension overhead is nothing, zip, nada.
In the same article, Poonen is quoted as claiming that SAP can migrate a 15,000 object BW physical data model from Oracle to HANA in only a few weeks – but what he doesn't say is how many of those objects are redundant, the legacy of SAP's attempts to mitigate BW's historic scalability and performance problems. And since in the next breath he apparently talks about migrating all of those objects to HANA without modification, it's very clear that migrating a BW installation from an Oracle DBMS platform to HANA will mean migrating, maintaining - and continuing to pay for - all of that complex legacy.
We can all agree that data redundancy is the problem: the redundant infrastructure and Systems Integration costs that result exponentially increase TCO; multiple, overlapping data sets, stored in multiple target data models on multiple analytic platforms and maintained by multiple ETL processes are invariably fatal for data quality and consistency; and without integration at the data model level, the cross-functional analysis that makes end-to-end optimisation (and re-engineering) of business processes simply isn't possible. But it's much harder to see any of SAP's current products as part of the solution – and it's not at all clear that SAP even recognises or acknowledges the extent of the micro- and macro-redundancy problems that it has.
It's the culture, stupid
Why would SAP require its customers to deploy multiple, redundant, overlapping BW instances, just to deploy multiple analytical applications? After all, isn't the whole point of building an Enterprise Data Warehouse (EDW) – a repository of integrated data so support Business Intelligence and Analytics – to store one, or close to one copy of the data, and to bring the applications to the data, and not the other way around?
The answer, I think, is that SAP is an applications company. Enterprise applications are the company's heritage and culture; they are its DNA. SAP – despite its acquisition of BusinessObjects - simply isn't an information management company, focussed on analytics – and it has never really understood Enterprise Data Warehousing in the way that the leaders in the industry do.
And apparently it still doesn't. Even before it has demonstrated that HANA is a credible platform for large-scale Enterprise Analytics, SAP appears to be claiming that HANA will be The One Database To Rule Them All, able simultaneously to support all of an organization's operational and analytical applications - despite the fact that if data are to be analysed as captured, without any transformation or modification to address data quality issues, it will require organizations to revolutionize their approach to information management and governance.
From a technology perspective, all of this would require mixed-workload management even more sophisticated than that found in Teradata's industry-leading implementation – and HANA appears to lack any such capability. This technology issue aside, this level of Enterprise - rather than Departmental – integration is simply not possible without the sort of very robust industry logical data models that SAP apparently lacks. Instead, SAP will almost certainly default to the model it knows best – distributed deployment of multiple, overlapping analytical databases – and HANA will probably principally be deployed as an an analytic application acceleration technology. This market segment – Data Mart, and principally BW Data Mart performance acceleration - is the one that SAP's management is really targeting with HANA; because simple, small-scale deployments of in-memory database technology will only be very – rather then extremely - expensive for its customers to deploy; and because they will not expose HANA's technological shortcomings. SAP's rhetoric about creating an affordable, scalable, high-performance platform for Enterprise Analytics is, at least for now, just that.
The song remains the same
So, as I explained in the first installment of this post, building a high-performance, scalable in-memory DBMS as a platform for Enterprise Data Warehousing presents SAP with some significant technological challenges - and as we discussed in the second installment, significant economic ones, too. And as we have seen in this third instalment, if SAP is serious about enabling Enterprise, rather than just departmental, analytics then it will need to fundamentally re-architect Business Warehouse so that it becomes a genuinely application-neutral Enterprise Data Warehouse.
Against that backdrop, Fulton's article might have been interesting and timely had he examined SAP's approach to overcoming these issues; had he assessed their progress in doing so; and had he identified any production references for HANA, complete with corroborated details.
Sadly, all Fulton has to share with us is a string of platitudes ("dramatic", "revolutionary", "turbocharger") - and the uncorroborated claim that HANA has improved performance by a factor of 400,000 at one customer, which, we are told, represents an improvement of "six orders of magnitude". Quite apart from the fact that these sorts of statistics are utterly meaningless when quoted without any context like this – indeed, as I have explained here before, are deliberately disingenuous and intended to obfuscate – an improvement of 400,000 represents, of course, five orders of magnitude, not six.
SAP's marketing department is to be congratulated for generating so much hype around HANA that some journalists have got into the habit of repeating it verbatim, without apparently doing any even basic fact checking. Whether SAP's Engineers deserve as much credit still remains to be seen - but on the evidence that is available thus far, HANA looks - to me at least - like an another Business Warehouse acceleration technology, rather than a credible Enterprise Data Warehouse platform. And given SAP's historic approach to Business Intelligence – simple report-oriented and based on functional data silos - that should probably surprise no one.
Director of Platform & Solutions Marketing