Data Engineering Needs Data Observability
by Rohit Choudhary, CEO & Co-Founder, Acceldata
Modern data pipelines are loosely put together sets of processes and technologies responsible for moving data from the point of origin to eventual consumption. Before data is consumed through various interfaces such as SQL, custom applications, ML and AI, it undergoes several transformations to turn messy input data, files and events into consumable data-sets.
Data Engineering is full of complex logic transforming data over its lifetime, from the time of its origin all the way to the point of consumption.
Complexity causes cascading failures due to rampant changes in logic and growing data volumes. Changes could be due to the way data arrives, how much data arrives, change in seemingly small logic sets to satisfy a new business group, or supporting a new business scenario.
Just like code, the logic associated with data can change frequently, data itself changes equally fast. With changes in data volume, resource requirements change too. Many abrupt changes are associated with disruptions.
Programmatic consumption of data is built on assumptions of consistent structure, availability of data fields, conformance to formats, frequency of arrival and the reliability of underlying compute infrastructure within certain time intervals.
If these assumptions are not true — data pipelines break. Data pipelines can therefore be fragile. Data teams keep an eye out for breaking changes, and potentially preempt issues that creates havoc — unhappy teams, long hours of data reprocessing, lag in dashboards.
I had the opportunity to learn about the above in raw, angry production environments where data was critical to business outcomes.
Let me tell you — observability is not just for applications and micro-services.
It’s for the most popular persona of our times — the data engineer.