Part 1: Lets keep it logical
When I look at the logical layers of Lambda Architecture, I always try to match them to the customer requirements.
Having talked to numerous international customers over the past years, I have come to realize that we are moving slowly beyond the Lambda approach (which has already been describing nicely certain technological choices). In the BI world of today, customers are no longer looking at the classification “real-time” and “batch”. We are rather moving now to events and interactions. While we could say: this is where the Lambda real-time layer will work, we need to ask: is this enough?
My view on that is:
– We need events that require actions and interactions without much of the analytics
– We need events that require actions, but they also need to be enhanced by the analytics in the ecosystem (based on other information sources)
– We need events that will be handled later on or they need to support the above-mentioned cases.
So, do we really need to make a logical distinction between real-time and batch – considering that real-time streams can be built to make them work as a mini-batch or by taking a batch load and while loading data define and execute action?
Keeping this in mind, I believe we should start evolving the Lambda Architecture into something like an “Omega” Architecture”. The name is not the relevant part here, but we really need to see how customers want to evolve based on their business, not their technical needs.
Having said that, I believe it would be a bad solution to have an event layer instead of real-time and batch as a layout across the whole BI landscape as we do not need to have special layers when we query streams. This can come as part of the physical implementation. What would happen, if I needed to enrich such query with data coming from batch views in the Lambda definition?
Instead, I suggest combining both layers. This would make it possible to provide results not just at the end, but also in the middle or even at the beginning of a query; depending on how the business requires interaction. Based on that, I have taken the liberty of defining architecture, which in my opinion is an evolution of Lambda.
Why do I call this an evolution? Because when you take a closer look at it, only at the initial stage, I have made a distinction between real-time (streams) and other feeds. With such an approach I wanted to highlight that Hadoop offers several options which share the same infrastructure for different applications. However, these will not always support streams and other data feeds at the same time. Then, data are initially injected into what I call an “All data” layer which could be seen as an early stage of the data lake where I store all the raw data for later use. This way, information can be pulled from the raw data, which of course will need to follow security and privacy rules depending on the industry or customer-specific requirements. When data are stored and governed, users can start with the data discovery. If I want to enable business users to run the analytics in an agile way, the data must be stored in the data lake layer.
Data binding represents an approach, which allows applying different integration techniques to the information. I can store them and parse schema later or I can parse schema right away so that I can use data already refined based on the customer-specific business rules. In the modern BI the question of how to bind data is key. But there is no single answer to it which is why I need to make sure our evolved architecture can support both. The rest of the architecture defines different access patterns for the users from reporting via analytics to more of a data science approach.
As we can see in every step of the architecture, I have a constant interaction between the consumers of the information – irrespective of whether it is an application, a message broker or a business user – and can thus make sure to meet any customer-specific requirements.