Data Mesh

Imagine a world without all the complicated data pipelines, without transformations, and without moving data elsewhere, yet still getting value from data. This is what Data Mesh is about, though there is, of course, more to it.

Data Mesh is not something you can buy. It’s a set of principles that we follow.

Data Mesh represents a decentralized approach to accessing data at scale and draws heavily from Domain-Driven Design.

It’s about deriving value directly from the source, rather than having data flow through processes to be valued later (with the added benefit of linking different models of those domains).

This domain orientation is somewhat like microservices in system architecture, but applied to data architecture.

Many current data architectures are built around ETL processes. ETL processes with fragile pipelines within the data lake.

Domains currently generate data for their own use, each with its own business processes, and don’t necessarily consider how analysts will extract value from that data later.

Data Lake teams are responsible for every part of the process, from data import, cleaning, and transformation to other representations, data pipelines in the data lake, and how the data appears in analytics and reporting.

These teams often become bottlenecks in the functioning of the data lake. We can’t just expand them indefinitely because it simply doesn’t work.

Do we even know what we’re discussing in daily meetings with so many topics and developers involved? OK, let’s split the team. But then… we have an even more complex structure that needs to communicate with itself.

Data Mesh involves both organizational transition and technology.

There is no specific technology that must be used. However, some technology is still necessary — meaning whatever fulfills our needs.

You could say you're implementing Data Mesh — or more accurately, applying the principles of Data Mesh, provided you have the appropriate technology and use it correctly.

We don't solve data management issues merely by purchasing a technical solution. It's more about how we use it, how people and processes interact with the data, as this is how we can address data quality and ownership problems.

By following Data Mesh principles, the issue of bottlenecks with highly specialized teams is eliminated.

Domain owners must take real ownership of the data and ensure it is usable for others who need it.

Is data from the entire domain accessible to everyone?

Or are only certain parts of it accessible (to a limited number of users)?

The technology should allow for unified (connectable) models with other domain owners. A model may connect only some of the data products from within the domains.

Data Mesh is not for everyone.

It addresses problems faced by larger organizations and the bottlenecks associated with Centralized Data.

Data Mesh is generally not intended for organizations with a small number of domains and/or data, it's designed for complex environments.

If you have a well-functioning data lake with just a few products and a relatively small amount of data, you might be perfectly fine without adopting Data Mesh principles.

As complexity grows, with more data sources and larger scale, the potential of Data Mesh increases.

Data Mesh is not implemented all at once.

Start small and improve. Build iteratively. Be use-case driven.

Create Data Products. They should be relatable and linkable to each other. Later, you can hand them over to domain owners. Even if they are not currently discoverable and accessible by other domains, they might be in the future once the platform and other domains are ready for it.

Avoid large, sudden changes. Prevent user mistrust ("I will continue using the current solution because it works, and the new one is too complicated").

There is no magic button for "applying Data Mesh" — click and it’s done. This is due to technological challenges as well as the challenge of changing company culture.

The principles of Data Mesh complement each other.

The first and most important principle is Domain Ownership.

The other principles are essentially implications of this first one. Domains alone do not facilitate the registration of data products, access to them, or defining access policies. This is why we have the following three principles:

Data as a Product

When creating domains, we must somehow make them accessible to the outside world, treating them as products that can be used by others.

Products can be registered on the platform, discovered later, and used, potentially combined with other products to extract meaningful data.

… and these products must be registered and maintained on:

Self-Serve Platform

Which is essentially the technical execution of Data Mesh.

Federated Computational Governance

Federated means that we have a federated group of experts from each domain who discuss and decide together on: how data is accessible, what exactly is accessible, and for whom, for which Data Mesh users/groups.
Computational means we can compute results based on different domain models.
Governance means how the policies are defined, whether we need to mask certain columns because some data should not leave the domain scope, and perhaps data cannot leave a certain region.

The biggest technical challenge is to create a truly self-service platform.

A self-service platform should not only allow for registering and finding products but also enable linking models (products) across domains. Such model connections can be realized through distributed query engines (like Trino or Spark SQL).

When implementing a platform, several questions and considerations arise:

How can we justify the increased cost of creating these domains?
Do we need new databases?

Perhaps not; it is still the same object storage we have now (we simply connect directly to the source).

Do we need to introduce new roles in the organization? What about the cost of that?

Not necessarily new roles; the right person needs to be empowered to use the self-service platform (it's about taking responsibility for managing the domain).

Significant investment in development

This may be true, but we are mainly using the same technologies as before, just in a different manner. We should also consider the cost of missed opportunities (if we do not adopt a Data Mesh architecture).

Production data directly accessible to others? That won’t work…

It doesn’t always have to be that way. Domains can provide the most relevant copy of the data, but it should still occur within the domain itself; similarly, if pipelines are needed, they remain within the domain.

Conclusion

For me, the key takeaway is embracing the culture of transformation that needs to occur. As a domain owner, people need to be responsible for the data to ensure it is of good quality, discoverable, and accessible.

You can have a great tool, but without transformation in the company and a shift in mindsets, it will be practically useless.

Check out also this webinar:
https://www.thoughtworks.com/about-us/events/webinars/core-principles-of-data-mesh/data-mesh-and-domain-ownership

Innovation at Scale

Saturday, August 24, 2024

Data Mesh - key features