Photo by Mollie Sivaram on Unsplash

Data Engineering Best Practices: How Netflix Keeps Its Data Infrastructure Cost-Effective

The Data Observer

--

Netflix is unquestionably the largest video provider in the world, delivering the most streams to the most customers from the largest video library that is by some estimates almost four times bigger than its closest competitor.

No one knows for sure, but that probably translates into Netflix having the largest data infrastructure in the world. As detailed in a July 2020 post on the Netflix Technology Blog, that includes dozens of data platforms, hundreds of data producers and consumers, 100,000+ server instances on its chosen platform, AWS, and many hundreds of petabytes in its data warehouse alone. At the very least, that rivals the massive data mesh operated by leading American bank, JP Morgan Chase.

Controlling costs is a priority for Netflix. However, Netflix didn’t want to employ the usual strategy of setting budgets and other strong guardrails. Netflix’s infrastructure is entirely in the cloud, and cloud costs are extremely dynamic and complex (which is why it has spawned a whole new financial field, cloud data Fin Ops). Moreover, such heavy-handed practices run counter to Netflix’s culture of “freedom and responsibility,” according to the blog, and are inefficient due to Netflix’s globally-scattered data infrastructure.

As a result, Netflix went the other way. Instead of setting hard limits on user costs, it decided to empower its data decision makers with as much cost transparency as possible. The centerpiece of this strategy is a custom-built “data efficiency” dashboard that provides a comprehensive source of truth for cost and performance for all Netflix data users and teams.

Source: Netflix Technology Blog

Data Platform Landscape

Netflix’s dozens of data platforms are divided into two categories: data at rest and data in motion. For data at rest, repositories and systems include S3 Data Warehouse, Cassandra, Elasticsearch, and others listed and not listed above. Being in the cloud, the costs come from storing the data.

Keystone, Mantis, Spark, Presto, and Flink are among the data-in-motion technologies used by Netflix. Costs here come from compute and processing of the data. These are tracked via job platform logs, API Extracts, and a custom monitoring system called Atlas that captures operational metrics for every data object, including CPU usage, memory use, network throughput, etc.

Each data platform contains thousands of separate data objects owned by various teams and users. A full list of objects and their size is stored in a list on S3.

Source: Netflix Technology blog

Providing Cost Visibility

Netflix had several goals. First, it wanted to get a unified view of the data cost for each team. These would primarily be engineering and data science teams, and secondarily engineering leaders. To do so, it needed to aggregate costs from all of its platforms, which is stored in the Netflix Data Catalog shown above. At the same time, it needed to be able to break costs down into granular units such as database tables, indexes, data jobs.

Netflix eats its own dogfood, using AWS billing to break costs down by service and even different platforms. However, this level of detail is insufficient for providing infrastructure costs by data resource and/or team. Netflix had to go further. For every data object running on EC2, the potential source of bottleneck was identified. A Kafka data stream may run into network issues, whereas Spark jobs may be throttled by lack of CPU or memory. Then, any bottlenecks are identified using Atlas, platform logs, and REST APIs. The bottlenecks are subtracted from the usage and cost calculations. With S3 data objects, the calculations are much easier, as it is simply a matter of multiplying the amount of storage by the cost-per-byte.

Dashboard showing week-over-week cost (annualized) for a specific team by data platform.(Source: Netflix Technology blog)

The Dashboard

Netflix built its custom dashboard using the cloud-native Apache Druid database. Though not a full SQL database, Druid can ingest high volumes of streaming data and provide simple but instant analytical results on the freshest data. Netflix’s dashboard leverages this to offer different real-time views, depending on whether this is for engineering and data science teams (above), or engineering leaders (below), or others. So costs can be grouped by data resource or organization. Snapshots and time-series views are both available.

Data costs split by organizational hierarchy. (Source: Netflix Technology blog)

Not Just Data, but Cost-Saving Recommendations

For scenarios where the potential savings justify the investment, Netflix has started to go beyond dashboard-based insights. For instance, all data has a shelf life. Data that was once heavily used may age over time, justifying its move from fast-but-pricey ‘hot’ storage to slower, less-expensive ‘cold’ storage, or even deleted altogether. However, Netflix found that its data owners were bad at estimating when data could be deleted or moved to colder storage. They needed Netflix’s help determining the optimal Time To Live (TTL) for stored data and sharing those recommendations.

The first beneficiaries of automated TTL recommendations were the owners of data tables stored in Netflix’s S3 big data warehouse, which is hundreds of petabytes in size. Netflix analyzes access logs and prefix-to-table-partition mapping to determine which tables are being accessed. It then suggests a date when it would be safe to delete each table, calculates the cost savings, and then presents those recommendations to the data owners.

Source: Netflix Technology blog

To avoid email overload, Netflix sends monthly TTL recommendations only to data warehouse table owners who could realize significant data storage savings.

Positive Outcomes All Around

Overall the data efficiency dashboard has been a success for Netflix, providing “great leverage in tackling efficiency,” according to the blog post. In particular, there has been “high ROI” at its massive data warehouse from the dashboards as well as the automated table TTL recommendations, as it has helped Netflix reduce its data warehouse storage footprint by ten percent, which translates to multiple tens of petabytes of data.

According to Netflix, there has only been one challenge with the dashboard, and that is delivering cost trend views over time. The reasons ranged from inconsistent data, data ingestion latency and changes in data resource owners.

As for next steps, Netflix identified two opportunities: 1) “using different storage classes for (data) resources based on usage patterns,” and 2) “identifying and aggressively deleting upstream and downstream dependencies of unused data resources.”

While Netflix has the massive environment and the corresponding budget to justify building its own data cost and data efficiency tools, other companies may decide that deploying a multidimensional data observability platform to be a much easier, economical, and more fruitful route to value engineering and cloud cost optimization.

Acceldata provides such a platform that can help companies save big on its data costs. Pulse, for instance, helps companies save millions of dollars annually by helping them offload unnecessary, over-provisioned software and optimize capacity planning.

Torch, our data quality and reliability solution, provides automated ongoing data discovery and cataloging. This ensures that all datasets wherever they are stored are visible to all users in the system through a centralized inventory. This prevents the growth of expensive data silos, and eliminates redundant data. It also helps users easily find the best datasets for their application. This creates a culture of data cost efficiency and reuse that reduces the proliferation of new datasets and data pipelines, keeping your costs in control.

Learn more about how Acceldata can help you emulate Netflix’s cost control strategy — get a demo here.

--

--