Data Engineering Best Practices: How Big Tech & FAANG Firms Manage and Optimize Apache Kafka
The popular open-source messaging/streaming system, Apache Kafka, is a key enabler for some of the most data-driven and disruptive companies today.
Uber, which processes trillions of messages and multiple petabytes of data each day with Kafka, calls it the “cornerstone of our technology stack.”
LinkedIn, which created and open-sourced Kafka and still drives much of its development, processes 7 trillion messages per day using Kafka (a 2019 statistic that is no doubt much higher today).
Meanwhile, Chinese social media company Tencent (maker of the popular WeChat and QQ instant messaging apps) processes more than 10 trillion messages per day using Kafka.
Since Kafka was open-sourced in 2011, a plethora of alternative event streaming, messaging and pub/sub (publish-subscribe) systems have risen to challenge Kafka: Flink, RabbitMQ, AWS Kinesis, Google Pub/Sub, Azure Event Hub, and others. All claim some combination of easier manageability, lower cost, and/or similar near-real-time performance as Kafka.
While some Big Tech companies like Spotify have responded by moving off Kafka, many others like Twitter continue to deploy Kafka or expand their use.
Overall, Kafka remains dominant due to its vaunted reliability, massive scalability, wide compatibility with other data and analytics tools, and flexibility, as it can be run on-premises, hosted in any number of public cloud providers, or as a fully-managed cloud-native service such as Confluent.
Nevertheless, Kafka’s reputation as being complicated for companies to set up and manage and challenging to optimize is not undeserved.
In this blog, I’ll detail how Big Tech companies manage and optimize their cutting-edge Kafka deployments. And I’ll also explain how companies that aren’t operating with the scale, budgets, and engineering manpower of a Walmart, LinkedIn or Uber can still efficiently manage and optimize their Kafka systems.
(Read other blogs in our series on Data Engineering Best Practices, including how:
- Facebook’s unified Techtonic file system creates efficiency from exascale storage
- Spotify upgraded its event streaming and data orchestration platforms
- LinkedIn scaled its analytical data platform to beyond one exabyte
- Netflix keeps its massive data infrastructure cost-effective
- JP Morgan Chase optimizes its data operations using a data mesh)
Walmart: Vaunted Real-Time Supply Chain Powered by Kafka
Key Stats (May 2022): Tens of billions of messages from 100 million SKUs in three hours for real-time replenishment system. Peak throughput of 85 GB per minute.
The largest brick-and-mortar retailer in the world, Walmart is also one of the most innovative IT users, investing $12 billion annually on IT, just behind Amazon and Google’s parent company, Alphabet.
Walmart is well-known for the size, speed, and efficiency of its global supply chain, which draws from 100,000 vendors and ships to 12,000 stores and delivery locations.
Crucial to orchestrating this is Kafka. Walmart houses its real-time inventory data using Kafka Streams and a Kafka connector to ingest the data into Apache Cassandra and other databases. Its real-time replenishment system, which resupplies Walmart warehouses when goods are low, also relies on Kafka.
Building these systems to support Walmart’s scale and complexity was not trivial. For its inventory system, Walmart had to solve three challenges:
- Diversity of inventory. There were more than 10 event sources, all using different schemas. Rather than trying to force everyone on to a single standard, Walmart’s central IT team created a “smart transformation engine” that converted all ingested data into a common standard for storage (see below).
- Scale. Partitions allow Kafka to scale, but doing so efficiently and cost-effectively required a lot of under-the-hood testing and optimization by Walmart’s data engineers.
- Maximizing the underlying database. To reduce latency and unreliable data, Walmart designated a specific partition for every item store.
Get more details from this Kafka Summit presentation.
As for real-time replenishment of its massive distribution centers, Walmart built a messaging system that minimized cycle times and complexity, maintained high accuracy and speed, and was both elastic and resilient. Using Kafka along with Apache Spark in a micro-batch architecture, data is streamed via 18 Kafka Broker servers each managing 20+ Topics (feeds) with 500+ partitions. Walmart can process tens of billions of messages from 100 million different product SKUs in less than three hours, with peaks of up to 85 GB per minute.
This is analyzed in a planning engine that factors in existing inventory, forecasts, lead times, shipping times, etc. The replenishment plans are then published through different Kafka Topics to Kafka Consumers in near real-time.
Walmart’s replenishment system runs in a multi-tenant architecture to support the 24 countries where it has stores. Active-passive data replication lets Walmart switch to a secondary data center if needed. To ensure no messages are lost, Walmart turned on Kafka message retries and developed an in-house audit and replay system. This keeps inventory accurate and supplies on schedule. Real-time alerts ensure the data engineers know of any problems immediately.
The outcome of all this data engineering work has been overwhelmingly positive, “ensuring timely and optimal availability of needed stock using Kafka and its ecosystem,” and enabling Walmart “to continue to lead retail industry innovation for many years to come.”
Uber: Enabling Real-Time Ad Hoc Queries through Presto and Kafka
Key Stats: Trillions of messages and multiple PB processed daily (April 2022)
Uber was not exaggerating about Kafka’s central role in its Big data stack. Kafka ingests all app data activity from its 93 million riders and 3.5 million drivers worldwide, as well as from APIs, database changelogs, and more.
Kafka feeds this data into these workflows:
- Flink for internal streaming analytics, alerts and dashboards
- Hadoop data lake that serves several hundred thousand queries per day as well as 2,000 unique analytic dashboards through Hive, Presto and Spark
- ElasticSearch for debugging
- Pinot and Presto for low-latency, TB-scale queries to enable real-time visibility and management of the overall Uber Data Platform and the Uber Eats delivery service
To head off scalability and performance issues with Kafka, Uber custom-built a Consumer Proxy that sits in between the regular Kafka cluster and the consumer service. Similar to Tencent’s proxy layer profiled later, this allows Uber to add partitions or clusters and make other operational changes without forcing downstream users and developers into potentially-tricky reconfigurations. Consumer Proxy has become the “primary solution for async queuing use cases” with hundreds of services using it. Learn more here.
One key use case enabled by Kafka was real-time, ad hoc queries. Data scientists, engineers and ops team members were clamoring for it in order to manage performance, debug customer service issues, etc. However, such exploratory and drill-down queries were taking up to ten minutes — far too long for effective root cause analysis. Uber rejected using Apache Pinot due to the data engineering required. Instead, it turned to Presto. The open-source distributed SQL query engine was already in wide use inside Uber due to its many data source connectors.
To ensure good performance for Presto-to-Kafka queries and reliability of its Kafka data, Uber made several key decisions for its data architecture (above). It required all Presto queries going to Kafka to filter by _timestamp or _partition_offset to avoid reading too much data and causing slowdowns or query timeouts. Queries without these filters are automatically rejected.
Uber also set a quota in its Kafka Brokers to limit the number of concurrent Kafka messages that Presto could consume. This would help avoid degrading the Kafka clusters. To enable users to self-service create new Topics, it enabled Presto to proactively discover cluster and metadata updates from Kafka (see below).
With these small-but-mighty configuration changes, Uber was able to accelerate ad hoc queries from tens of minutes to a few seconds using Presto-Kafka. Besides the speed increase, users say “data analysis is much easier.”
Twitter: Supporting Personalized User Content with Kafka
Key Stats: 400 billion events and more than a petabyte of data per day. (October 2021).
For much of the 2010s, Twitter used its own custom-built system called EventBus. In 2018, it moved to Kafka. This was not expected. Twitter had actually used Kafka in 2012 right after it was open-sourced, but found it “unsuitable” for being slow and unreliable. It decided to take a second look at Kafka for two reasons:
- New features in Kafka such as data replication
- Hardware advances, such as cheaper fast SSDs and faster server network interfaces
This was confirmed when it evaluated Kafka versus EventBus and found Kafka “had significantly lower latency” for many different workloads. Moreover, Kafka’s price-performance made it 75 percent cheaper to run than EventBus.
Upgrading to Kafka was part of Twitter’s overall move from a batch-based Lambda architecture to a streaming-first Kappa architecture (above) that combines Kafka servers in Twitter’s data centers and databases/data warehouses in Google Cloud. To guarantee no messages are lost between Twitter and Google Cloud, messages are re-sent until Kafka receives an at-least-once delivery notification from Google.
Twitter’s Kappa architecture is simpler, faster, and less-expensive. Real-time compute throughput is now 10–20x faster (up to 1 GB per second). Data latency shrinks from between 10 minutes and 24 hours to just ten seconds maximum. And using a deduplication process inside Dataflow, Twitter also improved its data accuracy and quality.
Using this new Kappa architecture as its foundation, Twitter has made other improvements. It rebuilt a machine learning logging pipeline using Kafka. Twitter’s Home timeline prediction system analyzes billions of tweets daily to show the most relevant tweets to Twitter users.
Using Kafka and Kafka Streams, Twitter cut the analysis turnaround from seven days to just one day, while also improving the quality of its ML model This ensures the presented tweets are always as relevant and engaging as possible.
Kafka is also used with Apache Druid to support ad hoc and exploratory analytics. Twitter has even moved an event storage system from Hadoop to Kafka. Learn more here.
LinkedIn: How Kafka’s Creator Pushes the Innovation Envelope
Key Stats (October 2019): 100 Kafka clusters, 4,000 Brokers, 100,000 Topics and 7 million partitions. 7 trillion messages processed daily.
As its creator, Kafka remains a “core part” of the infrastructure at LinkedIn, powering user activity tracking, message exchanges, metric gathering, and more. Witness the key statistics above, which barely capture the importance of Kafka to LinkedIn (architecture diagram below).
To support its scale needs, LinkedIn continues to do a lot of cutting-edge development on Kafka. It runs an advanced version of Kafka in production that contains new features and hotfixes not yet available in the open source code. These patches are eventually released to the community via GitHub with an -li suffix to differentiate them from the base Apache version.
LinkedIn’s Kafka engineers have created many patches improving the software. Some fixed slow controllers. Others reduced startup/shutdown time for Kafka Brokers. Another created a maintenance mode for Kafka Brokers to enable replicas (data) to be moved out of failing brokers smoothly. There are other improvements around usage for billing, and reducing data loss. Learn more.
Tencent: Federated Proxy Design Creates Massive Scale and Fault Tolerance
Key Stats (August 2020): 10 trillion+ messages per day, 1,000 physical Kafka nodes.
Tencent’s social media services are among the biggest in the world. WeChat has more than 1.26 billion monthly active users, while its youth-oriented QQ app has nearly 600 million active users.
Tencent’s Platform and Content Group provides the infrastructure for WeChat, QQ, videos, and other online services and content. It needed a multi-tenant, gigantic pub/sub system for ingesting logs across different regions, machine learning, and communication between microservices. One that could be provisioned instantly without disrupting existing workloads as well as tolerate single-node and cluster failure.
Tencent had three other major technical challenges:
- The scale and burstiness of its workload. Product promotions and large-scale experiments can cause Tencent’s messaging volumes to burst up to 20x from one day to the next. At its peak, a single one of Tencent’s many services may need to transfer 4 million messages per second (64 GB/second).
- Strict SLAs. As Tencent drives towards real-time analytics, it needed data to be ready for analytics within a few seconds, and with “an end-to-end loss rate as low as 0.01%.”
- Flexibility. Tencent needed to customize its messaging system quickly to adjust for different users, their queries, and the content they needed.
Tencent chose to build Kafka in the cloud with Confluent using a proxy layer that federated multiple Kafka clusters (1,000 nodes). This proxy layer presents topics to Kafka clients and internally maps them to the actual physical topics in each Kafka cluster (see below):
Building this proxy layer required upfront data engineering work. For instance, Tencent built a lightweight name service to map the relationships between Kafka clients and proxy servers. It also had to build a separate Kafka controller service to manage all of the federated cluster’s metadata, such as topic locations and the health of the nodes.
However, this extra layer of abstraction gave Tencent the massive scalability and fault tolerance it needed. For instance, Tencent can expand its data pipeline capacity with “little (re)synchronization overhead” by adding clusters without needing to move existing data. Tencent can also migrate physical clusters without affecting existing applications or logical topics, minimizing operational hassle for downstream users.
Tencent also developed its own monitoring and automated management tools to manage its several hundred Kafka clusters. This helps enable Tencent to initialize a federated cluster in an average of ten minutes, add a single physical cluster in two minutes, refresh its metadata in a second, and more.
By solving these challenges, Tencent can support sending more than 10 trillion messages a day, the largest confirmed figure by any company to date. Tencent says it is a “proud” Kafka user that has “successfully pushed some of Kafka’s boundaries” and plans to “further extend its capabilities.”
How to Optimize Kafka without the Manpower of Big Tech
As shown above, setting up Kafka and engineering it for maximum performance, reliability and cost-effectiveness takes a lot of work.
A slight misconfiguration to one of many Kafka cluster settings can lead to lost data. Adding new Kafka Brokers incorrectly can hurt Kafka performance. Data is difficult to observe. And ad hoc data transformations can be overly complex unless properly engineered.
A modern multi-dimensional cloud data observability platform can make the task of managing and optimizing Kafka much easier. Built on a Spark engine, Acceldata provides continuous visibility and insights into your real-time Kafka data pipelines, as well as cleanse and validate your Kafka data streams in real time.
One of our customers, PubMatic, is one of the largest adtech companies in the United States, serving up 200 billion ad impressions and processing 2 petabytes of data per day. To handle this, PubMatic had 4,000 Hadoop server nodes fed by 50 Kafka clusters, each with between 10–15 nodes. Due to the scale, PubMatic was suffering from frequent outages, performance bottlenecks, and slow MTTR (Mean Time To Resolutions). Then it deployed Acceldata Pulse, which gave its data engineers the visibility as well as the correlated insights to help them improve data reliability of the Kafka data pipelines, optimize price-performance, and consolidate its Kafka and Hadoop servers and save millions of dollars in software licenses.
Another customer, Asian telecom True Digital, was suffering similar performance and scalability issues that left half of the data streamed from Kafka into Hadoop unprocessed. After deploying Pulse, True Digital improved its data performance and enabled the smooth, reliable ingestion of 69,000 messages per second from Kafka to Hadoop.
Still another customer, Oracle, improved its Kafka performance and costs by 40 percent after deploying Pulse, enabling its Customer Data Platform to beat all SLAs while boosting the productivity of its data engineers three-fold.
In our constant quest to help data teams optimize data processing reliability, scale, and cost, we recently deployed a new Kafka utility called Kapxy inside Pulse. Kapxy provides previously-unavailable Producer-Topic-Consumer relationship metrics. With the below visualization in Pulse’s Kafka dashboard, engineering teams can quickly get to the root causes of Kafka performance issues and bottlenecks, and also optimize price-performance.
See the blog by our CEO Rohit Choudhary on how to diagnose the lag in your Kafka cluster and book a demo to see how our platform can help you powerfully and efficiently manage Kafka and the rest of your modern data stack.