How are you managing your disparate cloud data sources?

How to Manage and Optimize Cloud Data Sources

The Data Observer
8 min readJul 11, 2023

--

To ensure data quality, accessibility, and reliability, data teams must prioritize effective management of data sources within their data environment. By implementing a disciplined approach to the implementation, management, and continuous optimization of cloud data sources, teams can establish standardized processes for data ingestion, integration, and transformation. This fosters consistency and accuracy in the data.

Modern data environments adhere to stringent service level agreements (SLAs) and policies to simplify data operations. However, meeting these SLAs becomes challenging, particularly as new data sources are constantly added and need to be scaled effectively.

Data observability serves as a crucial framework for managing data sources by providing real-time visibility into the health, performance, and quality of data pipelines and sources. It enables data teams to proactively monitor and detect anomalies, errors, and bottlenecks in data ingestion and processing.

Through data observability, teams can ensure the reliability, accuracy, and timeliness of data. They can promptly identify and resolve issues, maintaining the overall integrity of the data environment.

Reducing Complexity When Managing Diverse Data Sources

The evolving data landscape presents enterprises with diverse and complex data sources, enriching their data experience. Today, data arrives in various formats, structures, and from multiple sources such as social media, IoT devices, external APIs, and more. While these new data sources hold immense potential for generating valuable insights, organizations must effectively incorporate them into their workflows while maintaining performance and reliability benchmarks.

Data engineering teams are well-equipped to handle the diversity of data sources and associated formats, although it can be complex. Data teams need to implement data integration mechanisms capable of efficiently handling different types of data. However, ensuring data quality and reliability from these sources may pose challenges due to variations in accuracy, completeness, and consistency. Additionally, managing privacy, security, and compliance concerns becomes crucial when dealing with data from external sources.

Data teams require data observability when integrating new data sources into their environment to ensure data quality. By closely monitoring the data pipelines associated with these new sources, data observability enables data validation checks and the identification of anomalies or inconsistencies. It empowers teams to rapidly detect and address any issues related to the newly added data sources, maintaining high overall data quality.

Furthermore, data observability facilitates performance monitoring. As new data sources are integrated, it becomes essential to monitor their data pipeline performance and reliability. Observability enables data engineers to track key metrics such as data latency, throughput, and processing times. By monitoring these metrics, the team can identify bottlenecks or performance issues that could disrupt the smooth flow of data. This, in turn, allows them to optimize the pipelines, ensuring improved efficiency and responsiveness.

How Data Observability Aligns Data Sources with SLAs

SLAs and policies govern the treatment and handling of data from various sources. Data observability plays a crucial role in aligning Service Level Agreements (SLAs) with new data sources by providing real-time monitoring and analysis capabilities. Continuous monitoring allows organizations to track the performance, quality, and reliability of data from these sources. Real-time analysis helps identify deviations or anomalies, enabling proactive measures to ensure SLA compliance.

Addressing gaps in SLAs for new data sources is another critical aspect of data observability. By closely monitoring data from these sources, organizations can compare actual performance against defined SLAs. Any discrepancies can be quickly identified and addressed through optimization of data pipelines, improvement of data quality processes, or enhancements to infrastructure.

An example of a tool that facilitates comprehensive data observability is the Acceldata Data Observability Cloud (ADOC). It enables organizations to identify trends and patterns in data from new sources, enabling informed adjustments to SLAs. By analyzing the behavior and characteristics of these sources, organizations can refine SLAs to accommodate variations in data volume, velocity, or quality.

This level of comprehensive data observability empowers organizations to establish feedback loops between data teams and stakeholders. Continuous communication and collaboration ensure ongoing alignment of SLAs with evolving data sources. Regular evaluations and feedback on the performance of new data sources help make necessary adjustments to SLAs, ensuring consistency.

However, maintaining consistency across all data sources is a challenge. Additionally, data environments are not confined to simple, binary approaches. Therefore, data teams need the flexibility to make changes to policies when appropriate and at scale. ADOC encompasses the necessary capabilities for data teams to continuously monitor the data environment, detect issues where data activity deviates from SLAs and policies, and rapidly modify SLAs when needed.

Establishing and Managing Data Sources in Acceldata

ADOC ensures the uninterrupted operation of your cloud environment through continuous monitoring, supported by two key features: Data Reliability and Data Compute. These features empower data teams to assess data quality, track infrastructure costs, and monitor other essential requirements. By setting thresholds for critical components, you can receive timely alerts when these thresholds are met or exceeded. ADOC offers comprehensive monitoring and management capabilities to optimize the performance and stability of your cloud environment.

The Data Reliability capability in Acceldata focuses on ensuring high-quality data within your systems. It allows you to establish policies aligned with your organization’s requirements, enabling you to maintain data integrity, certify assets, and perform other relevant functions.

Acceldata’s compute and infrastructure functionality provides operational intelligence by offering a visual representation of data from your Data Source. It presents a graphical view of crucial metrics related to your environment. You can configure alerts based on these metrics, triggering incidents when they surpass defined thresholds.

It’s important to note that the availability of Data Reliability and Compute capabilities may vary across different Data Sources. While some Data Sources support both capabilities, others may support only one. ADOC currently supports multiple Data Sources, each with its respective set of capabilities.

The table below provides an overview of the currently supported data sources in ADOC and the associated capabilities for each source:

Cloud data sources in Acceldata

The table provides an overview of the compatibility between each Data Source and the Data Reliability and Compute capabilities offered by ADOC. This will assist you in understanding the specific functionalities available for each supported source.

Please note that this table serves as a reference and represents the compatibility of a subset of data sources with ADOC’s capabilities.

How to Add Data Sources in Acceldata

In ADOC, integrating and configuring data sources is a straightforward process. You can easily add and set up data sources within the platform. Once a data source is configured, ADOC ensures continuous monitoring and offers comprehensive insights into the usage and other pertinent information related to your data sources.

With ADOC, you have the flexibility to create multiple instances of the same Data Source. For example, if you have three Snowflake accounts, you can create three separate instances of the Snowflake Data Source, with each instance representing a specific account. This approach allows for organized and efficient management of multiple data sources. The image below provides a visual representation of this concept, highlighting the clear organization and differentiation of the various data source instances.

By enabling multiple instances of data sources, ADOC empowers you to seamlessly manage and monitor your data environment, ensuring optimal performance and data reliability.

Set up requires just a few simple steps:

  • Click on the Data Sources tab
  • Click Add Data Source. The Select the Data Source type wizard is displayed
  • Select a Data Source
Adding cloud data sources in Acceldata

Please note: links are provided at the end of this article for specific data source documentation.

In ADOC, you have the convenience of easily filtering data sources to obtain a view that is as comprehensive or specific as required. The process of applying filters is straightforward and swift, as demonstrated below:

Acceldata offers comprehensive functionality to manage all aspects of your data sources that have been added to ADOC. This includes the following capabilities, with detailed information available in our documentation:

Editing configurations: You can easily modify configurations for your data sources, such as adjusting settings for specific workspaces within your Databricks environment.

Editing cloud data source configurations

Crawling cloud data sources: To initiate the crawling process on your data source and gain insights into its schema, you can utilize the “Start Crawler” option available in ADOC, as shown in the image below:

By starting the crawler, you enable ADOC to analyze the schema of the asset. This information serves as the foundation for implementing schema checks, which play a vital role in establishing profiling and data quality policies. The crawling process also facilitates cataloging of data sources, enabling their effective utilization within the platform.

To review the status of the crawl activity, simply click on the Data Source card within ADOC. This action provides visibility into the date of the last crawl. In cases where you haven’t initiated a crawl for a specific data source since adding it to ADOC, a message will indicate that it has never been crawled.

Please note that this feature ensures transparency and enables you to track the crawling activity for your data sources within ADOC, helping you stay informed about the latest updates and status.

Crawling cloud data sources

Service monitoring for data sources: Starting from ADOC version 2.7.0, ADOC introduces a powerful service monitoring feature that enables you to access real-time logs generated by the analysis service operating on the dataplane side. This enhancement provides you with convenient access to the ongoing logs for data sources, specifically those produced by the crawlers.

By leveraging this feature, you can gain valuable insights into the operational status and activities of both the analysis service and data source crawlers. This real-time log access empowers you to monitor and analyze the behavior of your data sources, ensuring greater visibility and control over their performance.

The ability to view the real-time logs enhances your understanding of the underlying processes, facilitating efficient troubleshooting and optimization. It enables you to proactively identify any issues or anomalies, leading to quicker resolutions and improved overall operational efficiency.

This service monitoring feature significantly contributes to maintaining the smooth operation of your data sources, offering real-time visibility into their performance and facilitating timely actions based on the insights gained from the logs.

Notes

To learn more about steps on how to add a specific Data Source, refer to the following links.

Photo by note thanun on Unsplash

--

--