Making Data Work For You: Data Observability Starts with Enterprise Data Quality Management
We’ve learned that at the typical large enterprise, data is treated with insufficient discipline. Enterprise data is ineffectively used during most key decision-making processes because there is a lack of visibility and understanding of available data.
More than 70% of employees have access to data they should not be able to see. And 80% of analysts’ time is devoted to manual data analysis because initial data quality is so poor, it can’t be automated. These problems will only grow as businesses increasingly rely on data-driven business models to help them improve their financial results.
Managing enterprise data is a massive topic, and the way to best understand it is through the lens of observability. Data observability provides a structured way of monitoring and managing data at scale across hybrid data lakes and data warehouses. In short, by implementing comprehensive data observability, enterprises can ensure that their data works to help achieve their most critical business objectives.
Data sources and data derivatives
We need to make a broad distinction between the source of data truth and derivatives of that data. For example, marketing, sales and customer service teams use data from CRM and marketing automation systems to track status about customers and prospects and where they are in the buying journey. Derivations of that information can be created by integrating it with data from other sources. This enables a marketing team, for instance, to create specific messages and custom campaigns to these same prospects and customers, all while using data from an original source and pairing it with data from other repositories.
At GE, the Finance Data Lake (FDL) team integrated 140 different sources to create a baseline to improve financial operations, including cash flow, accounts receivable/payable, and contracts. FDL provides the financial single source of truth for most GE businesses, and each business can use it for their own operational context.
The importance of enterprise data quality management
What does all of this mean with regard to implementing enterprise data quality management as a process? To answer that, let’s first define three key elements that are important to understanding who is benefiting from data quality and what they’re looking for:
Data producers: These are the applications and processes that produce the critical data that form the basis for an organization’s source of truth. In some cases, that data comes from third-party sources, but irrespective of source, it is the foundation for a type of data. For example, think of employee data as originating from an HR system like Workday that serves as the repository for HR-related employee information.
Data consumers: These are applications and processes which receive and consume data from various sources and deliver them in a usable format to end-users. Examples are LOB applications, management reporting consoles, machine learning and AI applications, and operational reporting in various parts of the organization.
Critical data elements: This is data that represents customers, locations, stores, or collections. They typically have long-term residence in data lakes, warehouses, and databases. Here again, consider the key data elements in an HR system, which are data points found in an employee profile. They might include salary, title, reporting structure, and other data that identify unique attributes about an individual.
In addition, it’s important to understand the different types of data outages and where outages can occur at various stages of the data lifecycle:
- Data errors are produced by faulty data entry from applications, missing streaming and feed information, or application outages.
- Bad data is consumed when there is faulty interpretation or a mismatch between expected data and actual data.
- Processing or transmission errors are most prevalent when data systems are built on top of interconnected data systems where data is in motion or streaming. Such processing errors exist in extract, transform, load (ETL) and change data capture (CDC) processes, as well as consumption from streams.
Testing to ensure quality
Data scientists need a foundation of early warning systems that test quality and conformance across every stage of the data lifecycle. They have to align these systems with testing schedules, and the results of that testing must identify where applications and data repositories are having issues. Applications which can process faulty data should know when data is no longer consumable, and the producing application group should act on it at once.
Not all data, however, can pass through rules, but quality checks of large production sample sets are mandatory. Data scientists use augmented data quality platforms to sift through vast quantities of data accumulated from numerous sources. A taxonomy of the data tables along with the interpretation of the relationships between data sources is crucial to be effective in propagating the usage of data collection across the enterprise. All this while retaining the sanctity and perimeter of control for sources of truth in the organization.
Data consumers can integrate with the data quality results and outcomes to programmatically run checks and put in circuit breakers. The outcome may be a report to an interested user group or a rerun of the data pipeline, or a rewrite of application logic if something has changed.
Data observability solves data quality issues
Data observability is an emerging field that allows enterprises to gain a semantic understanding of the underlying data and provides taxonomy of the data into producers, consumers and critical data elements. Once the primary sources of truth are identified, the production of that data can have a strong data validation check and advance information of failure is sent to the team that is responsible for that data.
Creating an enterprise data quality management process requires effective and reliable data observability because it enables data teams to work with large datasets with confidence without being restrictive. Enterprise data teams will need to protect their sources of truth but allow the proliferation of data with strict standards for network effects to benefit the organization.