Part 1: Alerting for the Open-Core Enterprise Data Stack
“In software systems, it is often the early bird that makes the worm.” — Alan Perlis
Enterprise data infrastructure continues to multiply in size, complexity and business value. Open Source Software and Open Core is firmly entrenched in the Enterprise Data stack to build Data Intensive Applications. The bottom-up selection of software, architecture provides tremendous momentum to complex production deployments. The geometrically interconnected system, ever-increasing data-pipelines carrying business-critical data are often times connected with single trip-wire and at risk of operational failures!
At Acceldata we believe that platform reliability is the key to running great data teams. The Enterprise Data Stack comprising of Open Source Software or an Open-core, is missing the alerting mechanism needed to represent cross-sectional, correlated insights.
Acceldata’s alerting platform is built for Data Intensive Applications responsible for stream processing, real-time and batch processing.
The Acceldata alerting engine sends advanced notifications across various channels for innumerable situations:
- Lack of capacity on a Yarn Queue
- Ever increasing size of Hive Data Partitions
- Every increasing number of HDFS small files.
- SLA Violations on critical Hive Business Processes
- Stuck Jobs, due to straggling Spark SQL
- Lagging Consumers of a Kafka Topic system
- Poorly written Spark code resulting in excessive garbage collection
- PySpark ML Algorithms slowing down from their standard SLAs
- Hardware issues such as — CPU utilisation, Slow Disk, I/O issues.
& many more.
Cluster Admins can act on these advanced notifications to guide the system back into its normal state. A unique feature of this advanced alerting mechanism is the ability to act on the same through the Automated Actions Framework, which will be part of a separate post. Devops which is morphing into Data Ops, needs every possible assistance.
The design considerations of this alerting system are as follows:
- An abstract separation, absolute non-interference of core data systems
- Unified DSL for creation of alerts across all kinds of data-systems
- Robust evaluation of comparative, mathematical, statistical and ML rules
- The evaluation engine should work on various metrics datasources of the types such as — document store, time-series, in-stream
- Unified, intuitively usable system to configure infra and application alerts alike
- Process equivalent communication alerts ranging from email to pagers
- An ability to trigger auto-corrective pre-configured workflows
The following are the core components of Acceldata Alerting system:
- Alert Service — Glue component for the rest of the system. It runs evaluators corresponding to the configured actions.
- Evaluators — Converts alerts definition DSL to appropriate database queries and executes them at the proper schedule.
- Notification system — When an evaluator triggers an incident, the notification system sends the incident across various channels.
- REST Server — Provides APIs for Alerts CRUD, incidents and executions.
- Administration Interface — Single page application runs in a browser and is used for configuring the system.
In the next parts of this series, we will cover real-world scenarios of alert creation, incident management and auto-corrective workflows. We will contextualise that with examples from Infrastructure, Storage, Streaming and Alerting systems.