Cloudera_PP4 Sections

Most organizations have a complex and sometimes chaotic collection of data storage and processing platforms. Between acquisition, new line-of-business needs and organic growth, a typical enterprise could have several databases and data warehouses, analytics platforms with different user communities and data transformation routines dictated by short-term needs instead of long-term strategy. A data fabric is an architecture that unifies all of these disparate data sources and applications in a secure, automated fashion without changing where or how that data is stored. It offers data access without data migration. This connected architecture makes it easier, faster and more secure for organizations to deploy data-driven applications and automation, and exposes more data-driven insights to business users in a self-service fashion. Instead of changing where or how data is stored, the data fabric is an overlay, connecting data with analytics and users wherever needed. While the data itself remains distributed across several on-premises and cloud resources, the fabric allows it to appear unified to end users.

Data Fabric

Data Mesh

Data Lakehouse

As the name implies, a data lakehouse is the newest iteration of data storage that blends the concepts of a data lake and data warehouse, which were both created to address the limitations of singular databases. Data lakes store both relational and non-relational data and neither capture nor impose structure. They’re a cost-efficient and thorough dump site for all of an organization’s data, but some data lake projects failed (becoming pejoratively termed data swamps) over scalability and access control problems. Data warehouses provide more powerful access to structured data in relational databases but don’t incorporate an organization’s wealth of unstructured data. “Those didn’t scale. They didn’t allow organizations to be as agile and flexible as they needed to be,” Stoop said. A data lakehouse is where all data comes together to solve critical business challenges. Data lakehouses make it easier for organizations to address structured and unstructured data (such as free-response text) through a single interface, while enabling them to add a variety of lightweight dashboards and rigorous analytical tools over time. This combines the curation, precision, completeness and tight governance of the data warehouse with the freedom, flexibility and granularity of the data lake, and can improve time to insight. Data lakehouses emphasize access based not only on user roles but also on data classification attributes, easily examined and modified protocols around governance and data retention, and the ability to distribute both storage and computational analysis resources across a hybrid of on-premises and cloud systems.

Many data management complexities stem from a decades-old tradition of treating data and related architecture as projects. Even if a particular project—such as targeted marketing messages based on data-driven profiles—could become evergreen, most likely the tools and techniques used to implement the particular solution were established by a small team for a narrow purpose. Over time, this narrow focus complicates design, obscures ownership and creates cumbersome rules throughout the organization for access and influence over data. A data mesh seeks to address these problems structurally, rather than technologically. It establishes data as a fundamental product (rather than a project). A team of internal experts takes charge of one or more data domains, and establishes rules for data workflow and delivery to end users. For example, the marketing department is responsible for bundling all marketing-related data, and the financial department bundles financial figures. In contrast to the centralization provided by data fabric, these domain experts act in a decentralized mode—but in accordance with uniform standards of interoperability and governance. That decentralization drives greater flexibility, agility and value: “Domains also make their own choices from an implementation perspective using the self-services data infrastructure, also part of the mesh,” said Wim Stoop, senior director of product marketing at the hybrid platform provider Cloudera. The resulting data products are controlled by domain experts but made available to data scientists and business analysts across the organization who can combine and remix them to suit their needs. “Unlike a data fabric or a data lakehouse, a data mesh is not a piece of technology or something you can buy,” said Stoop. “It’s a mindset around changing the people and processes connected to data; technology plays but a supporting role.”

Next Steps For Data Leaders

Embracing next-gen architectures is about evolution—not throwing out your entire data regime and starting over. But which direction should you choose? These questions will help you determine the right course for your organization.

Is there a single roadmap to adopting these approaches?

Not necessarily. Business needs and technical legacies dictate the best first move. “It very much depends on your maturity as an organization, from a data and analytics perspective, which of these next-generation modern architectures will speak to you most,” Stoop said. For example, organizations ingesting large volumes of unstructured data but struggling to extract value may want to pursue a data lakehouse first. A data mesh, meanwhile, requires independent cross-functional teams that have embedded data engineers, data product owners and data scientists. “Organizations that have centralized those skills may experience a shortage of each as they decentralize,” Stoop said.

Where do we begin?

A data fabric can help many organizations address the most common limitations of an existing data regime: having a firm grasp on data. “Start by knowing and understanding your data. That is something a data fabric is really good at exposing,” Stoop said. “Data fabric excels at highlighting your data sources, ingesting them and investigating them to understand the business relevance of that data.” If you have solid governance, the next frontiers are to further streamline the conversion of data into value. Organizations that have accumulated vast stores of data across on-premises and cloud-based storage but have struggled to realize business benefits may need more structure, which a data lakehouse can provide. “If you want to move to the hybrid cloud, a data lakehouse is a great way of driving efficiencies. If you want more agility and flexibility, think about a data mesh to radically decentralize your data organization,” Stoop said.

Do data fabrics, data meshes and data lakehouses work together?

Not all organizations need to pursue all three concepts, and certainly not all at the same time. But the three can be complementary and add value to each other. “The consistent security and governance that you build up as part of deploying a data fabric is one of the key principles of a data mesh. And it is something that a data lakehouse benefits from,” Stoop said. “So you can get started with one concept while moving toward the others. There is direct synergy to be had, especially when deployed from a single platform that can be used as all three of these architectures.”

How can these approaches address our business challenges?

By consistently and rapidly turning data into action. When applied, they can improve workflows likely to result in slowdowns or delays, or to produce untrustworthy results because of stale insights. “The changing circumstances, both in economic terms and in the social landscape, require organizations to get ever faster at turning data into value and insight,” Stoop said. “Look for processes which are too centralized or rigid, that don’t support agile practices or require you to reinvent the wheel every time, and seek out the solutions that help turn data into insight faster.”

Explore more from the Data Strategy Acceleration series

The Case For FAIR Data

Why Data Access Is Key

What Your Data Strategy Is Missing