How to Use Query History Metadata to Optimize Data Queries in Databricks and Snowflake

The Data Observer
8 min readJun 15, 2023

--

Cloud data platforms such as Snowflake or Databricks are experiencing a surge in popularity among organizations seeking to handle large-scale data workloads. These platforms offer robust and flexible solutions capable of managing millions of queries and data jobs, enabling the processing, analysis, and management of extensive datasets. As data teams strive to find more efficient and effective approaches to query management and optimization, these platforms have become invaluable.

These data workloads encompass a wide range of types, contingent upon the specific use cases within an organization. Some common examples include Customer Segmentation and Personalization, Fraud Detection, Predictive Maintenance, Supply Chain Optimization, Financial Planning and Analysis, Human Resources Analytics, Risk Management, and Marketing Analysis, among others. To better understand and categorize these workloads, they can be classified into the following categories:

  • Batch ETL (Extract, Transform, Load): This workload entails extracting vast volumes of data from diverse sources, transforming it into a suitable format for analysis, and loading it into a data warehouse or another storage system. It is typically scheduled to run regularly, often on a daily or weekly basis.
  • Exploratory workloads: This type of workload focuses on leveraging data to uncover new insights and patterns. It is commonly performed by analysts or data scientists utilizing tools like SQL or R.
  • Interactive workloads: With these types of workloads, you use data to address specific inquiries or make real-time decisions. Business users typically engage in this type of workload through tools such as dashboards or reports.

Good Queries, Bad Queries, and Too Much of a Good Thing

Poorly written queries can have a substantial impact on costs within cloud data platforms for several reasons. Firstly, inadequately optimized queries may require larger clusters or warehouses for execution, leading to elevated expenses. Additionally, such queries can increase the amount of data scanned or processed, resulting in extended query durations and subsequently higher costs.

The cost increase caused by bad queries is also influenced by the quantity of queries sent to the platform. For instance, an ill-designed application generating an excessive number of unnecessary queries can escalate costs due to the increased query operations.

To mitigate these issues, data teams have to prioritize the optimization of queries and minimize the number of unnecessary queries sent to the platform. This can be achieved through fairly standard techniques like indexing, caching, and other optimization approaches aimed at improving query performance. Many data engineers will employ monitoring tools which will enable the identification and troubleshooting of poorly performing queries, ultimately reducing costs. Not a bad strategy.

If you’re looking at the long-term impact of how you manage queries (and you should be), consider that bad queries can significantly amplify costs on cloud data platforms, a phenomenon often referred to as the “too much of a good thing” or “TMGT” effect. By consuming additional resources, increasing data transfer, and augmenting the query load on the platform, these queries contribute to higher expenses.

To mitigate these expenses, it is crucial to optimize queries, reduce unnecessary queries, and leverage monitoring tools to address poorly performing queries effectively.

Query Optimization Starts With Guardrails

In order to safeguard against the inefficiencies stemming from ad-hoc queries commonly executed during exploratory workloads, data teams should maintain continuous monitoring of their compute resources, utilizing features such as guardrails. Although cloud data platforms like Snowflake and Databricks offer basic capabilities in this regard, their utilization is not obligatory, often resulting in customers overlooking them. The Acceldata Data Observability Platform offers alerts and notifications when these guardrails are absent, ensuring enhanced visibility and proactive management.

The guardrails incorporated into the data observability platform empower data teams to accomplish the following:

  • Ensure adherence to administration best practices: Acceldata facilitates the identification and notification of violations through our comprehensive dashboards, alerts, and recommendations.
  • Enforce best practices by taking action against limit violations: The Acceldata platform offers a range of functionalities, such as resource monitors, cluster policies, statement timeouts, and various other formats, to proactively address and mitigate violations.

In the absence of such features, data teams run the risk of allowing cloud data platforms to autonomously scale up resources, potentially leading to unanticipated spikes in cloud costs.

How to Get Query Insights for Data Warehouses

The majority of workloads executed on data platforms consist of repetitive jobs or queries originating from scheduled ETL pipelines, interactive business intelligence (BI) tools, and similar applications. These workloads are characterized by their recurring nature, driven by scheduled executions or user interactions with BI tools.

Given the sheer volume of executed jobs or queries within a data platform, it becomes exceedingly challenging to observe and manage them at the individual query level. This is where techniques such as query/job fingerprinting come into play, enabling the grouping of similar queries or jobs. By comparing SQL statements and their parameters, query fingerprinting identifies queries with similar characteristics.

Grouping similar queries allows for more streamlined analytics and insights into the overall performance of the data platform. Instead of analyzing each query individually, analyzing the group as a whole reveals patterns, trends, and areas that require optimization.

The ability to group similar queries or jobs also facilitates prioritization of optimization efforts based on factors like dominance or costs. For instance, if a specific group of queries accounts for a significant portion of resource usage, optimizing those queries becomes a higher priority.

Presenting aggregate metrics for query groups enhances the consumability and actionability of the data. Stakeholders can easily understand the information and make informed decisions based on the insights provided.

Data teams need to be able to understand, debug, and optimize queries and warehouses by using tactics such as a measure-understand-optimize (MUO) cycle to improve the performance , efficiency, and cost of queries. There are typically three main steps in the process, which I explain below.

Acceldata the Query Studio for Snowflake, which offers data teams numerous advantages in understanding, debugging, and optimizing queries and warehouses. This feature encompasses a diverse array of functionalities to streamline the query process.

Aligned with Acceldata’s data observability principle, the Query Studio follows the measure-understand-optimize (MUO) cycle to enhance query performance, efficiency, and cost-effectiveness. The cycle comprises three key steps:

Measure: The initial step involves assessing the performance and utilization of both the data platform and the executed queries within it.

Access to comprehensive records of all queries performed in your Snowflake environment proves highly beneficial when troubleshooting issues. Analyzing these queries can unveil usage patterns across various warehouses and users, shedding light on potential areas for enhancing the efficiency and cost-effectiveness of your Snowflake environment.

Given the capability to execute numerous queries within a specific timeframe, having the ability to filter queries by status, warehouse name, query type, databases, and other categories proves invaluable in narrowing down the set of queries requiring examination. Additionally, displaying the query count alongside the filter name enables users to easily comprehend the query set from different perspectives, making it a valuable tool for identifying potential issues.

During the optimization process, it is vital for administrators and centers of excellence (CoE) to verify the adherence to best practices and efficient query writing. An essential aspect of this process involves identifying and resolving anomalous queries. The availability of these filters simplifies the detection of such anomalies, enabling prompt corrective measures. Note that each filter may offer distinct metrics that provide valuable insights for additional optimization efforts.

Understand: The next step entails comprehending the collected data. This involves conducting an in-depth analysis to recognize patterns, trends, and obtain insights regarding query performance and usage. This critical step enables the identification of areas requiring optimization and aids in understanding the root cause of performance issues.

Through the utilization of the fingerprinting technique to group similar queries, users can obtain valuable insights into various aspects of the identified workloads within a specified timeframe. These insights include execution frequency, cost, average latency, and scheduling/frequency. Additionally, observing the trend of grouped queries across different metrics such as execution time, number of rows produced, and costs provides a comprehensive overview.

Query grouping greatly simplifies the debugging process for performance issues. It allows users to effortlessly compare selected runs of grouped queries, along with all associated metrics, facilitating a deeper understanding of the underlying factors contributing to performance challenges.

Optimize: The final step entails optimizing the data platform based on the insights obtained from the previous steps. This involves implementing actions to enhance query performance, reduce resource usage, and minimize redundant data scanning costs, among other optimizations. ADOC offers a range of recommendations to optimize queries and improve the performance of your cloud data platform.

By conducting a thorough analysis of query history metadata at the grouped query level, the software can provide suggestions for optimizing warehouse sizing, identifying and addressing unused large tables and high-churn tables, as well as optimizing table update clustering keys. It can propose ways to reduce external file listing time, optimize SQL queries, select only necessary columns and limit query results, apply filters before joining, avoid cartesian joins, and improve join performance through the use of unique keys, distinct keys, and clustering keys.

Additionally, the software can recommend methods to enhance performance for high-frequency small DML operations, order by and group by operations, and suggest filters that utilize appropriate equality comparisons for numbers, dates, short strings, data type matching, and clustered columns. These recommendations aim to improve overall performance and efficiency in various areas of your data platform.

Establish Continuous Query Optimization

Gaining a thorough understanding of query behavior and identifying performance bottlenecks are essential for maintaining the efficiency and stability of your cloud data platform.

By leveraging query fingerprinting and other observability features, you can delve deeper into your workloads and pinpoint the most resource-intensive queries. This empowers you to optimize those queries and enhance overall performance, leading to a more reliable and cost-effective cloud data platform.

The ability to compare runs of grouped queries alongside their associated metrics simplifies the troubleshooting and debugging process for performance issues. This streamlined approach facilitates efficient problem-solving and ensures timely resolution.

Utilizing the query history metadata available on your cloud data platform enables proactive identification and resolution of performance issues. This measure-understand-optimize (MUO) approach becomes a cyclical process, allowing for continuous monitoring and fine-tuning of the data platform to maintain optimal performance levels.

Photo by Victor on Unsplash

--

--