Why a new architecture for modern stream processing is needed for realtime data analytics as we see explosion of fast moving device data which contain high value for short period of time
There is a rapid shift happening at the data level as we speak. More and more data is being created by devices, which are fast-moving and contain very high value but perishable insights. This shift is demanding a new architecture of stream processing platform for real-time data analytics. Data has been growing exponentially. We have more data streaming through the wire that we can keep them on disk from both value and volume perspectives. These data are being created by everything we deal with on daily basis. When humans were the dominant creator of data, we naturally used to have less amount of data to deal with and at the same time value used to persist for a longer period. This in fact holds true now as well if humans are the sole creator of the data.
However, humans are no longer the dominant creator of the data. Machines, sensors, devices, etc. have taken over a long time back. The data, predominantly, are being created by machines with humongous speed, so much so that in the last two years we had 90% of the data created since the dawn of civilization. These data tend to have limited shelf life as far as value is concerned. The value of data decreases rapidly with time. If the data is not processed as soon as possible then it may not be very useful for ongoing businesses and operations. Naturally, we need to have a different thought process and approach to deal with these data
Since we are having more of these data streaming in from all different sources, that if combined and analyzed then a huge value could be created for the users or businesses. At the same time given the perishable nature of the data, it’s imperative that these data must be analyzed and used as soon as they are created
The value of data is maximum when it’s created, the streaming data is perishable in nature, needs to extract the insights immediately
More and more use cases are being generated that need to be tackled to push the boundaries and achieve newer goals. These use cases demand the collection of data from different data sources, joining across different layers, correlation, and processing across different domains, all in real-time. The future of analysis is less about understanding “what happened” and more about “what’s happening or what may happen”
Let’s analyze some of the use cases. Consider an e-commerce platform that is integrated with a real-time stream processing platform. Using this integrated streaming analysis of data, it could combine & process different data in real-time to figure out the intent or behavior of the user to present a personalized offer or content. This could increase the conversion rate significantly or reduce to eroding customer engagements. It could also have better campaign management to yield better results for the same spend
Think of a small or mid-size data center (DC) that typically has many kinds of different devices and machines each generating volume of data every moment. They typically use many different static tools for different kinds of data in different silos. These tools not only restrict the DC from having a single view of the entire data center but also works as a BI tool. Because of this, the issue identification in a predictive or real-time manner doesn’t happen as a result firefighting becomes the norm of the day. With a converged integrated stream processing platform, DC could have a single view of the entire DC along with real-time monitoring of events, and data to ensure issues are caught before they may create bigger problems. A security breach could be seen or predicted much earlier before the damage is done. Better resource planning and provisioning could be done by analyzing the bandwidth usage and forecasting in near real-time
The entire IoT is based on the premise that everything can generate data and interact with other things in real-time to achieve larger goals. This requires a real-time streaming analytic framework to be in place to ingest all sorts of unstructured data from different disparate sources, monitor them in real-time, and take actions as required after identifying either known patterns or anomalies
AI and predictive analytic means that the data is being collected and processed in real-time otherwise the impact of AI could only be in understanding what happened. And with the growth of data and types, it will be prudent to not rely solely on what has been learned so far in the hindsight. Demand will be in reacting to new things as it’s seen or felt. Also, we have learned from our experiences that a model trained on older data often struggles to deal with newer data with acceptable accuracy. Therefore, here also the real-time stream processing platform becomes the required part rather than a good to have a piece
There are two broad categories in which we can slot the options available in the market. One is an appliance model and another one is a series of open-source tools that need to be assembled to create a platform. While the former costs several millions of dollars upfront the latter requires dozens of consultants for several months to stitch create a platform. Time to market, cost, ease of use, and lack of unified options are a few major drawbacks. However, there are bigger issues to be addressed by either of these options when it comes to stream processing and here, we require a new approach to solve the problems. We can’t apply older tools to newer, future-looking problems, otherwise, it will remain a patchwork and would not scale to the needs of the hour
Here are the basic high-level challenges when it comes to dealing with the stream of data and processing them in real-time.
Most of the options in the market suffer from these bottlenecks. Let’s take a few examples.
Spark follows the map-reduce model philosophically although in a much more efficient manner. However, it still deals with batches. Spark deals with micro-batches of a given size with a given batch interval. It has several problems when it comes to aligning with stream processing, in fact, its approach is the antithesis of stream processing
This model typically uses 5 or more distributed verticals, each containing many different nodes. This increases the network hops, and data copy, to a great level which eventually increases the latency. Scaling such a system is not trivial as we have different dynamic requirements at a different levels. Further, the cost of adding new processing logic is significantly higher than a simple BI tool where things could be handled using a dashboard. Finally, it requires a large team and resources which increases the cost. This can hardly be deployed for a scenario where sub-second latency is desirable
AWS kinesis at best is equivalent to Kafka, a distributed, partitioned messaging layer. Users still must assemble, process, store, vision, etc. layers themselves
We need a platform that is designed and implemented inhomogeneous manner, towards a single goal of process stream data in real-time, which avoids all the above pitfalls and remain immune to future requirements, and scales well for higher load and volume of data
BangDB has tried to address most of the above-mentioned problems by designing and building the entire stack from the ground up. Here is a brief introduction of the BangDB platform in the light of the known issues identified
BangDB has built a high-performance NoSQL database that scales well for a large amount of data. Also, it follows the convergence model to scale linearly. Scaling “single thing” vs “many things” addresses the problem to a large extent. BangDB also follows the FSM model and implements SEDA to achieve a cushion against the sudden surge in data
BangDB follows a true convergence model for higher performance, ease of management, and massive linear scale
BangDB removes all silos. The silos not only add the latency but also forces data to be copied across different verticals
BangDB processes the data before it reaches the disk. This is opposite to most of the systems in the market. Further BangDB also avoids post-processing as much as possible, almost negligible. All these happen when data reaches a node, therefore there are no network packet hops for the data
Convergence allows BangDB to partition application and data and all other resources in a single-dimensional manner. This enables the partitioning of space rather than the partitioning of a different set of spaces. Therefore, it naturally enforces optimal use of added capacity and resources which is otherwise is difficult to predict and provision
BangDB process every single event rather than micro or macro batches. Hence, the data is updated in real-time, a pattern is identified in real-time, and insight is also served to the application in real-time. Most of the streaming real-time use cases emphasize the need to process data in a sliding window, BangDB provides a configurable sliding window within which most of the processing happens
True stream processing with continuous sliding window. Most of the operations happen within the sliding window
BangDB follows the reverse of Map Reduce to achieve very high performance for reads. This is done by avoiding all sorts of post-processing of data and keeping the data in a format needed by the user
BangDB stores both raw data and the extracted insights or aggregated data within the system. It’s a persistent platform hence it could store as much data as required. However, most of the time it is critical to process data in real-time and then push it to an offline system for deeper analysis. Therefore, BangDB connects with Hadoop and other long term storage frameworks as well
BangDB has an IO layer that uses SSD as an extension of RAM rather than a replacement for File System, thereby allowing out-of-memory computations and data handling without severe degradation in performance
BangDB also uses SSDs in totally different ways to achieve cost-effectiveness and elasticity. SSDs are typically used by others as a replacement for file systems (or HDD) where the gain is limited and if not used properly life and performance go down as well. BangDB has written software to mimic SSD as an extension of memory by which the performance can be increased multifold and also the cost-effectiveness can be achieved to a great extent
BangDB aims to be predictive. BangDB processes and analyses data in both an absolute and predictive manner. It uses complex event processing for absolute pattern recognition. It also uses supervised and unsupervised machine learning for the pattern or anomaly detection. BangDB platform provides simple ways to upload and train models. It also integrates with “R” for data science requirements
Both appliance and open-source models provide a technology platform where the new analytic application or processing code must be developed and deployed on the production system. This requires a typical test for production DevOps and release management. BangDB provides an integrated dashboard to make the platform totally extensible. Users can perform all actions using the dashboard without ever developing code or application. Further BangDB has developed pre-baked apps in different domains and uploaded them to its AppStore such that users can simply take those apps, configure and start dealing with real-time insights.
BangDB platform is hosted on a cloud as a SaaS model along with AppStore with several solutions. This allows users to start within an hour or even less. There is no stitching time, deployment time, or even development time, everything is ready to go with a set of clicks
BangDB can be deployed within the device for state-based computations including CEP and ML processing. Further BangDB could be in LAN and Cloud too. All of these could be interconnected for supercharging orchestrations. BangDB has a subscription model and users can start within a minute using BangDB SaaS and then grow as needed. Get started with BangDB by simply downloading it
Further Reading: Also check out two other blogs on similar topics which might be useful.