How Data Observability Goes Far Beyond Data Quality Monitoring and Alerts

  • Accuracy
  • Completeness
  • Consistency
  • Freshness
  • Validity
  • Uniqueness

Data Quality Monitoring: Defensive and Dumb

Why Bad Data is an Epidemic

  1. Data pipeline networks are more sprawling and more complicated than ever, due to the rapid uptake in easy-to-use cloud data stores and tools by businesses over the past decade. The higher velocity of data creates more chances for data quality to degrade. Every time data travels through a data pipeline, it can be aggregated, transformed, reorganized, and corrupted.
  1. Data pipelines are more fragile than ever, due to their complexity, business criticality, and the real-time operations they support. For instance, a simple metadata change to a data source such as adding or removing fields, columns or data types can create schema drift that invisibly disrupt downstream analytics.
  2. Data lineages are longer, too, while their documentation — the metadata that tracks where the data originated, and how it has subsequently been used, transformed, and combined — has not kept pace. That makes it harder for users to trust data. And it makes it harder for data engineers to hunt down data quality problems when they inevitably emerge.
  3. Traditional data quality testing does not suffice. Profiling data in-depth when it is first ingested into a data warehouse is no longer enough. There are many more data pipelines feeding many more data repositories. Without continuous data discovery and data quality profiling, those repositories become data silos and pools of dark data, hiding in various clouds, their data quality problems festering.
  4. Data democracy worsens data quality and reliability. As much as I applaud the rise of low-ops cloud data tools and the resulting emergence of citizen data scientists and self-service BI analysts, I also believe they have inadvertently made data quality problems worse, since they by and large lack the training and historical knowledge to consistently handle data well.

The Modern Solution for Data Quality: Data Observability

  1. Automate tasks such as data cleansing and reconciling data in motion in order to prevent minor data quality problems before they start
  2. Slash the number of false-positive and other unnecessary alerts your data engineers receive, reducing alert fatigue and the amount of manual data quality engineering work needed
  3. Predict major potential data quality problems in advance, enabling data engineers to take preventative action
  4. Offer actionable advice to help data engineers solve data quality problems, reducing MTTR and data downtime

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
The Data Observer

The Data Observer

Thoughts and trends on data observability