The Big Data movement is giving rise to runaway growth of new file systems and databases with various types of data access mechanisms that attempt to link connotations with SQL but differ widely in their architecture, usage (application), development and implementation approaches. As a result, there is confusion around their use in business intelligence and analytics communities with assumptions that NOSQL (Not Only SQL), NewSQL databases are a replacement for the relational databases in BI and analytics.
In fact, many of the NOSQL ‘databases’ simply serve as a replacement for RDBMS that have been traditionally used in some OLTP applications that needed a refresh to cater for the agility required in Web 2.0 and Smartphone Mobile applications and they promote the use of Hadoop / Hive for reporting.
I have been curious myself to see how the NOSQL databases would fit into the BI and analytics ecosystem. In this blog, I make an attempt to look at data design aspects at different era of application development, viz. Pre-SQL Era, SQL Era and Post-SQL Era.
The Pre-SQL Era is characterised by main frame systems, procedural programming and file based processing with serial / indexed sequential access, in the period before and well after E.F.Codd conceived the relational algebra. Data processing costs were very high which led to a heavy focus on efficiency in design and coding, much of which involved capturing documents efficiently in file systems as there existed no database management systems. Flexibility in schema-free design allowed for variability in record descriptions within the same file. The file section of the COBOL programs that I often wrote would look something like the design below. Notice the repeating groups marked in red colour:
There was no BI. Computer programs needed to be written for producing reports and/or any downstream processing, barring a few exceptions that allowed limited software packages such as FIND2, XMERGE, XSORT from computer manufacturers (i.e. IBM, ICL et al.) that allowed declarative use for reporting and further processing with simple aggregation based on key values matches. Essentially, an era that was dominated by NoSQL (No SQL).
Reusability is facilitated through creation of ‘Libraries’ for file descriptions (as above) that may be included in the downstream consuming programs which helped retain consistency and ensure quality of development. However, redundancy was unavoidable due to technology constraints and lack of best practice data management principles.
The SQL Era saw the evolution of Information engineering (IE) that enabled development of strategic enterprise data, function and process models independent of each other but nevertheless in an integrated fashion. It resulted in reusable processes for rapid delivery into production as integrated databases and reusable systems. A key aspect of IE planning is the top-down design of a logical data model that represents data and their relationship to the business. It is based on entity-relationship mapping that conforms to the principles of relational model and Normal Forms which helps to reduce redundancy and ensures integrity of the database with ACID (Atomicity, Consistency, Isolation, Durability) properties.
A redesign of the Passenger Ticket file design from the Pre-SQL Era looks like the data model below in the SQL Era.
The logical data model serves as a blueprint for the enterprise data warehouse (EDW) which is implemented in a RDBMS that collects the cross-enterprise data and makes it usable in a declarative style. Because of the upfront schema design and robustness involved in architecting the solution, it usually leads to slowness in time but provides the benefits of a stable design, reliability, data integrity, ease of management and lower total cost of ownership (TCO).
However, business managers can also supplement with data from external sources, such as market surveys, and integrate them into the data in the EDW by employing agile analytics methodology alongside structured development. This allows them to enhance and enrich the data that is already familiar to the enterprise.
A defining characteristic of the Post-Era SQL, viz. NOSQL (Not Only SQL) movement is agility, schema-less design, non-relational and no joins. In fact, one of the tutorials advises, “Duplicate the data because disk space is cheap as compared to compute time”; Do joins while write, not on read”. The methodology for software design is agile and bottom-up. Another hallmark of NOSQL is that it is designed for fault tolerance and based on the principles of BASE (Basically Available, Soft State, Eventually Consistent). Also, data / documents in NOSQL databases are mostly consumed by writing Java programs that make use of published APIs; less focus on declarative style.
They also recommend using Hadoop HDFS for storage and Hive for reporting. All of these imply data redundancy, lack of reliability and increased governance overhead and costs.
I wanted to see what a NOSQL design would look like for the same application design that I attempted above. I picked MongoDB which comes up frequently when discussing Document store. In MongoDB, my design starts to look like the one below.
In MongoDB, documents have dynamic schema and are collections stored in NOSQL database. A document is a set of key-value pairs composed of JSON (Java Script Object Notation) or XML. Dynamic schema means that documents in the same collection do not need to have the same set of fields or structure, and common fields in a collection’s documents may hold different types of data.
This allows for insertion of data without a predefined schema. In fact, it looks like the file design in the Pre-SQL era and does not event pass the test for 1NF (First Normal Form)! In the above example, notice that there are two documents in the collection PassengerTicket, with one customer having no baggage while the other customer instance shows repeating group. Also notice the differences in fields for name and frequent flyer ID.
In short, NOSQL databases, be it Document Database, Key-value Data Store or Column family, are built to easily make significant OLTP application changes in real-time, without worrying about service interruptions – which means development is faster and less database administrator time is needed.
NOSQL databases, due to their agility and no schema design requirement, can cause huge burden on the ETL processes in EDW requiring increased emphasis and costs on data governance to ensure the organisations KPIs are correctly reported for making reliable decisions.
The good news is that if you are looking for agility in analytics without sacrificing data quality, stability and reliability of the EDW then there are a number of options available, viz. Agile analytics with an integrated data lab; Teradata’s Unified Data Architecture optionally with in-database JSON integration for NOSQL data sources. All of these provide lowest total cost of ownership (TCO), better information governance and best business value while leveraging the agility and flexibility provided by the NOSQL databases in the enterprise.
Teradata has long provided an agile analytics data warehouse with an integrated data lab that provides a self-provisioning, self-service environment for swift prototyping and analysis on new, external and uncleansed data.
Teradata’s Unified Data Architecture (UDA) provides best of worlds that enable integration of the data source types of Pre-SQL, SQL and Post-SQL eras within a seamless architecture. Teradata’s UDA is further enhanced with native JSON integration which is particularly relevant to “Active” data warehousing strategy, affording developers the flexibility and agility to quickly add new fields to databases without changing the schema. Best of all, it is analytics-ready for the ‘Internet of Things’ (IoT) and provides time to market advantage.
Sundara Raman is a Senior Communications Industry Consultant at Teradata. He has 30 years of experience in the telecommunications industry that spans fixed line, mobile, broadband and Pay TV sectors. He specialises in Business Value Consulting, business intelligence, Big Data and Customer Experience Management solutions for communication service providers. Connect with Sundara on Linkedin.
Latest posts by Sundara Raman (see all)
- Making Smart City projects smarter with Smart Organisations - March 29, 2017
- IoT will accelerate industry convergence and structural disruption - October 25, 2016
- Internet of Things – Lessons from an IoT prototype project - August 22, 2016
- How Come NPS (Net Promoter Score) Data Doesn’t Rate Ben Affleck Movies? - August 17, 2016
- Which Open Source technologies are suitable for your Big Data roadmap? - June 27, 2016