Simplifying Databricks Cluster Management and Enhancing Performance

9 min readJul 26, 2023

Enterprises today place major importance on data reliability and quality as they prioritize development of data products and services. To achieve these goals at scale, data observability has emerged as a crucial strategy. In this pursuit, leveraging the Acceldata Data Observability platform proves to be an optimal choice, offering a robust framework for ensuring data reliability, optimizing performance, and managing costs. Particularly for enterprises utilizing Databricks, the integration with Acceldata sets the stage for unparalleled operational observability into their Apache Spark deployments.

Databricks is the essential cloud platform for Spark, providing users with the ability to manage clusters and deploy Spark applications within cloud environments. This open-source unified data analytics engine enables large-scale data processing and caters to Spark users through both expressive options like Scala and Python, as well as straightforward approaches like Spark SQL, enabling seamless handling of petabyte-scale data. What truly sets it apart and entices data teams is its versatility in launching and managing Spark clusters across the three major cloud platforms.

Recognizing the significance of Spark within modern data stacks as a distributed data technology and cluster configuration solution, it emerges as one of the most impactful ways to manage and optimize data job performance. However, for those not well-versed in Spark, configuring clusters and jobs for maximum efficiency can be a challenging task. This lack of expertise can lead to issues such as poor performance, operational inefficiencies, and higher cloud costs. Some commonly observed problems include the absence of cost transparency and governance, resulting in unexpected expenses; underutilized, inefficient, or unhealthy clusters and workflows; prolonged time spent on debugging workflows; unadministered and fragile account configurations, coupled with limited knowledge of newer technologies; and the lack of data layout/partitioning and usage analytics.

Addressing these challenges, Acceldata steps in to support enterprise data teams in observing their clusters and job performance, while also facilitating the implementation of data reliability techniques for Delta Lake on Databricks.

The latest release, version 2.7.0 of Acceldata Data Observability Cloud, introduces several new capabilities tailored for Databricks, including Databricks guardrails, a Databricks Query Studio akin to Snowflake Query Studio, optimized Databricks navigation, and a host of other features.

In this blog, we will demonstrate how to seamlessly deploy Acceldata to your Databricks cluster, highlighting essential benefits, such as:

360-degree observability for all facets and functionalities of the Databricks data platform.
A comprehensive, unified view of incurred costs, encompassing both Databricks-specific and cloud vendor expenses.
Reduced mean time to resolution (MTTR) for broken jobs, enhancing operational efficiency.
Access to actionable insights, anomaly detection, alerting, reporting, and guardrail recommendations, enabling robust administration of Databricks accounts.

With Acceldata’s Data Observability for Databricks Performance Optimization, data teams can unlock the full potential of their clusters, ensuring seamless performance and reliable data processing for their enterprise needs.

With the user’s convenience and data security in mind, the Acceldata Data Observability platform offers a straightforward and secure method to register a Databricks data source. During the registration process, users are prompted to input Databricks connection details, essential credentials, and given the option to opt for compute and/or data reliability.

Once registered, Acceldata Data Observability unleashes its visualization capabilities, providing valuable insights into the following categories:

Spend Tracking: Gain a clear and transparent view of your data-related expenses, helping you manage costs effectively.
Cluster or Compute Usage: Monitor and analyze the utilization of clusters or compute resources, ensuring efficient resource allocation.
Workflow Debugging: Identify and resolve issues in your data workflows promptly, reducing downtime and increasing productivity.
Data Insights: Unlock valuable insights from your data, enabling data-driven decision-making and optimizing overall data performance.

With Acceldata Data Observability, you can confidently manage your Databricks data source, leverage actionable visualizations, and ensure seamless and reliable data operations throughout your enterprise.

Spend and Cost Tracking for Databricks

Within the discipline of spend tracking, the Acceldata platform offers comprehensive visibility across various crucial dimensions of Databricks. This includes insights into cluster types, workspaces, node types, and a breakdown of costs between Databricks and your cloud vendor for the dataplane. Such a holistic view enables you to pinpoint the key drivers behind the costs associated with specific Databricks resource types.

By providing these top-level views of your Databricks environment, Acceldata empowers you to identify areas that require deeper optimization efforts. These insights serve as a starting point, guiding you towards understanding where to focus your efforts to achieve greater cost efficiency and resource optimization within your Databricks setup.

Databricks costs can be allocated to various units within an organization for efficient management. Leveraging Databricks tags enables automatic resource assignment to specific organization units. Once these units are configured, historical cost analysis categorized by organization units becomes readily available, with cost areas prominently highlighted. This enables swift optimization analysis, allowing you to prioritize organization units for optimization efforts based on their cost implications. With this approach, you can streamline cost allocation and resource utilization, ensuring that each unit’s performance aligns with the organization’s overarching goals.

Databricks cost analysis by organizational unit

Cluster and Compute Usage and Management for Databricks

In the cluster tracking feature, you gain access to comprehensive dashboards that offer valuable insights into the state statistics and potential failures of your clusters. Given the intricate architecture of the control plane/dataplane and the distributed nature of the system, failures can occur, making it crucial to have a clear overview.

Acceldata’s intelligent agents, deployed as part of the solution, collect and correlate system-level metrics. This empowers you to identify inefficient or wasteful resource utilization within your workspace and pinpoint the users responsible. Often, clusters may encounter failures due to reasons like improper sizing, cloud provider resourcing, or permission issues. Acceldata promptly notifies you of such failures, allowing you to remediate the issues swiftly, ensuring seamless cluster performance and optimal resource utilization. With these capabilities at your disposal, you can proactively maintain cluster health and improve the overall efficiency of your Databricks environment.

Databricks cluster management in Acceldata

Optimizing cluster rightsizing is paramount for achieving both peak performance and cost-efficiency. By referring to the cluster usage charts, you can readily identify wasted cores and memory resources within your clusters. These visualizations offer a clear picture of the resources that are underutilized, allowing you to take timely action for better resource allocation.

Moreover, Acceldata provides a comprehensive list of observed clusters, allowing you to apply various filters to analyze and dissect the data. This feature aids in building a profound understanding of cluster configurations and usage patterns. Armed with this valuable information, you can make informed decisions to fine-tune your clusters and ensure they are performing at their best, both in terms of performance and cost-effectiveness.

Databricks Insights

Acceldata delivers essential insights regarding the Databricks File System (DBFS), offering a comprehensive view of critical information such as the cost of cloud object storage, the total number and size of tables, and updates occurring within these data assets. Understanding the data lake costs associated with storage and API calls made to cloud vendors is of utmost importance for effective cost management and resource optimization.

With Acceldata’s in-depth analysis of DBFS, you can gain valuable visibility into your data assets, enabling you to make informed decisions to optimize storage, minimize unnecessary API calls, and ensure cost-effectiveness within your data lake environment. By harnessing this information, you can efficiently manage your cloud object storage expenses and better utilize resources while maintaining the performance and reliability of your data assets.

Acceldata insights for Databricks environments

Databricks Workflow Debugging

Acceldata presents valuable workflow-level insights and recommendations, consolidating all essential metrics like database, CPU, memory, and disk, in a smartly scrubbed and correlated manner within a single location. The platform allows you to delve into the historical trend of job executions, facilitating side-by-side comparisons of different runs. By automatically generating lineage charts for intricate Databricks workflows, Acceldata simplifies job understanding and debugging, significantly reducing mean time to resolution (MTTR) for any observed issues.

As Databricks jobs grow in complexity with contributions from larger data engineering teams over time, grasping the job’s actual functionality can become challenging. Acceldata addresses this by providing a comprehensive graph view, describing the job’s designed tasks and all the steps it takes to execute them effectively.

Monitoring resource metrics at the stage level is crucial for understanding the job’s data processing. The Databricks job is split into tasks, further organized into stages, representing similar computing tasks for parallel execution across worker nodes. Each worker node processes a partition of data, and Spark initiates shuffle operations when data from different workers are needed. Identifying poorly performing stages compared to others becomes essential for performance optimization.

Acceldata simplifies this process by offering a stage-by-stage breakdown, sorted by time taken, and detailed metrics related to each stage. This comprehensive view empowers you to optimize your job’s performance and resource utilization, leading to enhanced efficiency and better insights into the overall workflow execution.

Data Reliability for Databricks

Now, let’s delve into the critical aspect of data reliability. As the saying goes, “garbage in, garbage out,” data is the lifeblood of data-driven enterprises. Organizations that heavily rely on data for decision-making through dashboards or machine learning models cannot afford to have erroneous or unreliable data. Data failures can manifest in various ways, and data engineers must take a proactive approach to identify and resolve these issues swiftly.

Ensuring data reliability is paramount to maintaining the integrity of analytical insights and the accuracy of machine learning predictions. By actively addressing data issues, data engineers safeguard the foundation upon which critical business decisions and processes are built. Swiftly fixing data-related problems is crucial in upholding the credibility of data-driven operations and promoting seamless functionality throughout the enterprise.

Among the vast array of thousands of data assets at our disposal, determining which one to focus on can be a daunting task. To simplify this process, we categorize and prioritize data assets based on their usage and reliability. Although the ideal scenario is to have all data of high quality, understanding the usage patterns and data volume allows us to segment and prioritize data for running data quality rules.

Data cadence dashboards offer timely views of data assets, which encompass the concept of “timeliness.” This includes assessing the freshness of updates on asset tables over time, identifying any significant delays based on historical data, and more.

Monitoring the volume of incoming data is another crucial aspect. The volume can be observed at the row-level or file size, depending on the user and asset type. Acceldata simplifies this process by automatically detecting and profiling the most frequently used tables in Delta Lake. This profiling can also be configured manually to suit specific preferences. Data profiling allows you to grasp the data distribution across columns, aiding in understanding data characteristics.

Built to scale with your data, Acceldata ensures efficient profiling even for substantial datasets. For instance, a data profile for three million rows can be generated in just 90 seconds. With the data profile in place, Acceldata seamlessly detects anomalies within your dataset, empowering you to apply data quality policies, data reconciliation policies, detect schema drift and data drift, inspect data lineages, and much more. By leveraging Acceldata’s capabilities, you can gain comprehensive insights into your data’s integrity and take proactive steps to maintain high-quality, reliable data assets.

Data pipelines are the backbone of continuous data management undertaken by data teams. The primary objective is to ensure the pipelines remain healthy, operational, and cost-efficient. In this regard, the visualization presented here offers a comprehensive view of the data pipeline, illustrating the various assets it interacts with and the compute operations performed on them.

By examining this view, you gain insights into the health of data assets and the time taken for the logic to execute, contributing to the overall runtime of the pipeline. This visibility allows you to monitor the performance and efficiency of your data pipelines effectively.

Additionally, Acceldata empowers you to set up alerts based on pipeline health, runtime, or other relevant metrics. These alerts enable proactive monitoring and timely actions to address any potential issues, ensuring smooth and uninterrupted data flow through your pipelines. By focusing on pipeline health and efficiency, data teams can maintain a robust data infrastructure that meets business needs while optimizing operational costs.

Photo by Tim Mossholder on Unsplash