Data Engineering Best Practices: How a Data Mesh Helps JP Morgan Chase Optimize Data Operations
JP Morgan Chase is one of the largest banks in the world, with nearly $130 billion in revenue in 2020. With 50,000 IT employees and an annual IT budget of $12 billion a year, the company invests heavily to ensure its technology gives it a competitive advantage. An example is JPMorgan Chase’s data infrastructure, which includes a whopping 450+ Petabytes of data serving more than 6,500 applications, according to a presentation at AWS re:Inveny 2021, including one that processes 3 billion messages a day.
The bank recognizes the importance of data and widely shares it internally. Yet in a highly-regulated industry such as banking, making data too accessible can also lead to disaster.
“To unlock the value of our data, we must solve this paradox,” wrote JPMorgan officials in a 2021 blog on Amazon’s AWS site. “We must make data easy to share across the organization, while maintaining appropriate control over it.”
Like any large enterprise, JPMorgan had a lot of stored data in relational databases. As an early big data proponent, JPMorgan had also adopted Hadoop widely, which it used to build a monolithic on-premises data lake managed by a central data engineering team. While Hadoop continues to play a key role for analytics at JPMorgan, the bank also recognized how embracing the public cloud could decentralize data ownership and encourage data democratization and business innovation.
JPMorgan Chase first created a comprehensive data structure that is based around the concept of “data products”. These are collections of related data that may or may not map to existing business lines or even IT systems. For instance, one JPMorgan Chase data product includes all the data around wholesale credit risk, such as credit exposure, credit rating, and credit facility harvested from many different data stores and applications. Another data product is focused on trading and position data, including cash, derivatives, securities and collateral. Using the term “data product” instead of dataset or repository or even data asset is meant to create a shift in mindset by highlighting the goal: enabling data to produce business results, rather than accumulate dust in some forgotten database, according to James Reid, JPMorgan CIO for Employee Experience and Corporate Technology, in a July 2021 presentation
Source: JP Morgan
Each data product is curated and owned by a team that includes a business owner, a technical owner, and multiple data engineers. They own and deeply understand their specific data product, its uses, its limitations and its management requirements. At the same time, giving each data engineering team end-to-end ownership of a domain encouraged and empowered them to consolidate any “data puddles” and “data ponds” under their management that feed a JP Morgan Chase data lake, said Reid.
Each data product is stored in its own physically-isolated data lake. While most are stored on Amazon S3, there are some still stored in on-premises repositories due to regulatory realities, said Reid.
All of these data lakes are cataloged by AWS Glue, Amazon’s serverless data integration tool. In addition, there are consuming applications used by employees that are physically separated from each other as well as from the data lakes. These separate, but interconnected, domains create JPMorgan’s data mesh.
Amazon AWS cloud services interconnect the distributed domains. AWS Glue Data Catalog enables applications and users to find and query the data they need. This enterprise-wide data catalog is automatically updated as new data is ingested into the data lakes, checked for data quality, and curated by data engineers with domain expertise.
Source: JP Morgan
The catalog also tracks all data requests and audits that flow from data to applications. This gives JPMorgan Chase data engineers a single point of visibility into how their data is being used, which is key for JPMorgan Chase to remain compliant with the many regulations it faces. This metadata also helps users looking for data they are entitled to use that is both relevant and trustworthy.
Meanwhile, AWS Lake Formation enables data to be securely shared to approved applications and users. Neither applications nor users are ever allowed to copy or store data. This reduces storage costs and prevents the creation of “dark” data silos that lose freshness and accuracy over time, creating data quality and security problems. And without extra copies of data floating around, it’s easier to manage data and enforce policies and access controls.
Source: JP Morgan
Finally, JPMorgan Chase uses a trio of cloud-based database engines to query the data, which includes Amazon Athena, Amazon Redshift Spectrum, and Amazon EMR for non-SQL data processing. Machine learning is done via Amazon Sagemaker.
For JPMorgan Chase, its Amazon cloud-based data mesh satisfies three key technical priorities: high security, high availability, and easy discoverability. And that is supporting the outcomes JPMorgan hopes to achieve with its data: cost savings, business value, and data reuse.
With a framework for instantiating data lakes that uses a data mesh architecture, JP Morgan Chase was able to share data across the enterprise while giving data owners the control and visibility they need to manage their data effectively.