In the admittedly small world of people who obsess over data technologies, one of the hottest topics of the last year has been the “data mesh”.
Created by Zhamak Dehghani of ThoughtWorks, the concept struck a chord and made the rounds in countless conversations on Twitter and elswhere.
As I highlighted in the 2021 MAD Landscape, the data mesh concept is both a technological and organizational idea. A standard approach to building data infrastructure and teams so far has been centralization: one big platform, managed by one data team, that serves the needs of business users. This has advantages, but also can create a number of issues (bottlenecks, etc). The general concept of the data mesh is decentralization – create independent data teams that are responsible for their own domain and provide data “as a product” to others within the organization. Conceptually, this is not entirely different from the concept of micro-services that has become familiar in software engineering, but applied to the data domain.
It was a real treat to get to chat with Zhamak at our most recent Data Driven NYC.
Below is the video and below that, the transcript.
(As always, Data Driven NYC is a team effort – many thanks to my FirstMark colleagues Jack Cohen and Katie Chiou for co-organizing, Diego Guttierez for the video work and to Karissa Domondon for the transcript!)
VIDEO:
FULL TRANSCRIPT
[Matt Turck] Welcome Zhamak, we are very happy to have you. Have been looking forward to this conversation. I’ll start with a quick intro to get it out of the way. You work with Thoughtworks as the Director of Emerging Technologies in North America with a focus on distributed systems and data architecture. You have a deep passion for decentralized technology solutions, and you have founded the concept of data mesh, which is what we’re going to talk about today. You are a member of the Thoughtworks Technology Advisory Board and contributed to the creation of Thoughtworks Technology Radar. You have worked as a software engineer and an architect for over 20 years, and have contributed to multiple patents in distributed computing communications, as well as embedded device technologies. Welcome.
[Zhamak Dehghani] Thank you for having me.
[1:01] You caused quite a stir with this data mesh concept within the admittedly small circle of data geeks. Including people on Twitter that have been tweeting back and forth, about what does that all mean? What are the consequences? What does it mean for my company? What does it mean for my customers? So I am excited to dive into all of this and maybe let’s start at the highest possible level. What is the problem you are trying to solve with this?
[1:34] That’s a good question. I think the angle to take in terms of problem: Data tries to enable organizations to get value from their data, with their analytics and ML solutions under a completely new world order. Which is where we are right now. What do I mean by a new world order? I mean, in an organization where complexity is the norm and complexity of organization, in terms of where the data can get generated. How the data can get used. Organizational complexity where growth is constant, change is constant, major acquisition happens, you’re constantly dealing with new types of data, new sources of data.
[2:21] The new world order in terms of new aspirations is how we want to use the data. We moved away from ‘I’m going to run a few set up a warehouse and get a few reports and get insight into the operation of my organizations’, to ‘Actually I want to run, include ML, a data driven way of solving problems into every feature of my application.’ So just look at Zoom or look at Spotify or any application that we use daily has ML embedded into it. So under this new world order where complexity is norm, where constant change is norm, when growth is norm. Data mesh tries to solve the problem of getting value while dealing with, while embracing this complexity.
[3:06] There’s no lack of tools and platforms and solutions to try and tackle these complexities. So what is the industry currently doing wrong, that needs to be improved?
[3:19] Yeah, you’re right. When I used one of your publications where you put the landscape of all this data, I feel kind of dizzy when I look at it. It’s almost unreadable. So I do agree that, what we have done right up to now is a lot of innovation in the bottom layer of the stack. If you think about our technology stacks, where the data driven organizations and consumers sit at the top and the machines sit at the bottom. We’ve been building a lot of technologies, kind of try to solve really hard, low level problems – the problems of data processing at scale or data storage at scale and distributed computing and storage at the bottom layer. And that’s great. What we’ve got wrong is a set of technologies that scale out nicely with the growth of organizations. So what we haven’t got right so far, what we’ve built I suppose, has led to… I don’t know if actually what comes first, whether the organization comes first or technology comes first, but what we’ve built is suitable for organizations that are functionally divided.
[4:32] You run your business on one side and then you deal with the data on the other side. So you put a wall between what’s data driven and what’s not data driven. So I think that functional separation, what we’ve built has led or has embraced centralization both from put the data on my platform so you get value from it or put the data in the warehouse and have some sort of mechanical model around it to get value from it, or put it on the lake. So there’s some sort of a centralization, both from the locality as well as kind of the organizational structure that deals with the data and maintains it. So all of those points of synchronization, centralization of the data, centralization of the organization, functional division between the data and non-data. Those are the things that have led to a kind of a system that is fragile to scale and change at the macro level, not at the bits and bytes level. Right? We’ve got a system that’s fairly fragile to change in scale and that’s what we’ve got to change.
[5:35] So what is it, what is the data mesh? I’m super excited to hear it from the horse’s mouth so to speak, because again on Twitter, it’s like, What does it mean? And like somebody says something, somebody else says something else. I’ve loved the conversation, but I’m very excited to hear it from you.
[5:55] I think it’s like an onion we’ve got to kind of peel the layers. But when I started talking about it, I wanted to be very cognizant of the… I guess, maturity of the idea, right? So I started with a set of principles and I think I’ve talked about those principles in many forums and we can go over them very quickly. So I guess at a high level, it is a kind of a socio-technical approach in how we share and manage data for analytical use cases. And socio-technical because, we cannot just talk about technology without talking about how people who will use the technology are subjected to the technology. So it’s both an organizational structure, operational model within the organization, as well as the technology in order that we can get value from data at scale for analytical use cases. So that’s kind of a tagline.
[6:44] And then if we unlayer the onion a little bit. The most fundamental underpinning principle around it is this idea of, you can get to data, get value from it, connect it with other kinds of data sources, no matter where the data lives and no matter who owns it. So it’s counter to bring the data to one place on the one team on the one model, to get value from it. So underpinning that is this idea of decentralization of ownership and control around the axis of business domains. That’s the axis within which organization can grow. And that’s an axis around which we have actually aligned that technology, operational services and business. So let’s extend that like with microservices we did that, right? 10 years ago. Let’s extend that idea with the fact that every business unit that’s aligned technology kind of organization and technology architecture, we can also align the data and data sharing. At the core it’s that, basically. That’s the whole idea.
[7:46] But when you actually peel it a little bit further, you go, holy crap. That can cause a ton of problems. If I just did that, if I end up with this silo of databases all over the place, that’s where we are today, right? That’s why we try to kind of put everything in one nice, beautifully designed place. So then it follows with a few other principles in terms of how to address that concern of disconnectivity, lack of interoperability. And it introduces this concept of data as a product, which is a foundational unit of data sharing. And it’s very different from what we imagine what data is. And we can talk about that in a bit. It comes with the idea of a new kind of platform, and the next generation type of platform that really enables cross-functional teams, to manage and share and consume data, rather than data specialists we’ll move really towards the generalist satisfaction and a new way of thinking about governance, so we’re not compromising privacy or security. All of those higher objectives that the governance has, but we are realistic in how do we implement it in a very decentralized fashion.
[8:51] If you peel those principles further, I think that’s where I lose people like up to this point, people are kind of like, I get it, I like it, I want to make friends with you and, we can be old friends and talk about the same thing, but then once you actually go… In fact, how can you have decentralization? Have your data copied around and transformed if you need to, and yet be able to run this distributed analytical workloads… What does it actually take to do that? Then we get to the discussion of those pieces of technology that needs to enable. And that’s how… How do you say… You lose friends and make enemies. That’s where I guess a little bit of confusion or lack of agreement will happen because we have to make some compromises, I suppose, to get all of these other wonderful things that we have. And that’s a discussion that I don’t think, we have really had so much. I’m reflecting on it in the book that’s coming out later this year, but I don’t think I’ve really talked about it.
[6:56] That’s super interesting and makes a lot of sense. Indeed, I’d love to get into the next level of detail, at the risk of possibly losing friends. To understand how that manifests exactly, right? Because decentralized architecture and teams, like everybody sort of understands and that indeed has been the trend. You have a data science team or data analyst team, and they have a data warehouse, and they have tools to put the data into the data warehouse. And then on the other side, they have tools to do analytics, and a chief data officer. People understand that as sort of the common thing, but what you’re suggesting is radical decentralization of all of this. And I’m very curious about how that manifests, like data as a product. Is that like a unit of data, people and product and technologies for that product, and then it’s going to be like the next one and how do they communicate? You alluded to like a lot of this, but I’d love to get into the details.
[11:00] Sure. and it is a long conversation. So I don’t think there is one right place to start it from. But if we, if we start the conversation around what are the affordances or capabilities that we really want to provide at the end of the day, no matter if it’s centralized or decentralized, let’s look at those. And, what we really want, what are those affordances? Then let’s go deeper and say, how do we provide those affordances with this idea of decentralization? Let’s imagine the experience of a data scientist, like from a data scientist perspective or this. The experience of a data scientist perspective or data consumer perspective, the first thing that they want… they do is a hypothesis. They have a hypothesis, ‘Can I make, I don’t know, recommendations around, kind of, playlists and music, based on music profiling? And can I do music profiling based on the mention of the music in various blog posts and so on, if I’m in a streaming business?’
[11:52] Starting from that hypothesis, what I need to be able to do is, I need to be able to discover the data that I’m looking for, right? No matter where that data actually physically lives. So, we need to have ways of… ability to discover the data, and hence, it needs to have been registered with some way. Address it, discover it, connect to it, get to the approach of that data, or the actual dataset itself, no matter who owns it and where it is.
[12:22] Whether I’m a data scientist or data analyst, I need to be able to use my native tools, what’s native to me, to process that data. I think this, kind of, funny war between warehouse and lake and what model we should access the data, that seems to me a little bit irrelevant, because both of those access models are very acceptable.
[12:43] We need a new construct that provides data, in multiple modes, for a particular, let’s say, social profiling of your music. You can access it with native tools or native access model for data scientists as columnar files, or you can access the same data, writing some sort of a generic query like SQL or different types of queries. So then… So, that’s the next step.
[13:08] Just for these steps. So, data discovery and access. Is there still a concept of somewhere there’s a catalog? Or is that too much centralization if you say you have a catalog that knows where all the data is distributed?
[13:25] Yeah. So, really good question. So there is a concept of a discovery portal of some sort, right? I need to be able to see, browse, search, right look for the data that I need. But, when it comes to the next step of implementation of these affordances, if we think about it as data is this, kind of, piece of information without agency, without any computational ability, which is how we thought about data so far. The design we come up with is that there is a central catalog that will go and look for data in different places, and add some metadata to it based on who accessed it and who used it, and it will create a catalog and it will constantly sync by searching this mess of a landscape that is. While we need a discovery portal of some sort to search and browse, what data mesh introduces is this concept of a data quantum actually rethinking data as a unit that, not only constitutes the data, but it also constitutes all these computational affordances that gives that data agency and intention.
[14:39] What do I mean by that? So let’s follow with that discovery example. As a data product developer, I’m creating this new logical construct, I am, in fact, intentionally providing discoverability abilities in that unit. So, I am providing a set of APIs that any search or browse utility can hook up to and get discoverability information about me. And as that data point, I intentionally, with agency, provide that information. I’m not this dumb bits and bytes sitting on this, I’m lively, I have a computation going on. So, then I will have APIs that provides, okay, ‘What sort of guarantees I’m intentionally providing? What, sort of, model of the data I am exposing? What are other guarantees, in terms of the timeliness, quality, completeness, all of those things.’ And I’m constantly computing these as I’m generating new data.
[15:40] So it’s sort of part of my contract, right? I have my own data product, but the trade-off is, I need to make it discoverable and anybody can access it. Is there a concept of almost like SLA of, ‘Hey, I’m in charge of my data product. It needs to be discoverable or it needs to be up, needs to be clean, it needs to be usable.’?
[16:04] Yeah, absolutely. So, this data product, I use this idea for data quantum as a logical counterpart for it, like this thing that you actually build because data product can be applied to many different… But, on the mesh, this logical, kind of, units, absolutely. It will have interfaces, APIs, to provide your SLAs, to provide also, that logical units, what it does is consumes data from somewhere else. It provides some operation on it, whether this operation was NLP and discovering what was the information was actually saying about this track and do music profiling. And then it does some sort of operation transformation and then it provides output. And for each of those outputs, it provides ancillary information to make that output actually understandable and then accessible in multiple modes.
[16:54] But, in addition to that, this, kind of, data quantum, for it to be complete – structurally complete, because we’re thinking about this decentralized model where we have structurally complete units that can provide value on their own. There is another piece to it, not just the data that it gets and the data it provides and the SLOs and discoverable to APIs, but also the policies that govern it.
[17:19] Now, we’re bringing together policy that governs this data, the data itself, the transformation and code that keeps this data actually alive. I mean, I can’t really think of a data that is dormant. Any piece of data almost is continuously changing, and the code that keeps it alive and the policies that govern it. So, that thing, if we put it up, we usually put a hexagon around it. It’s the idea that is really foundational to data mesh and it’s, kind of, nonexistent.
[17:49] Great. So, to help continue bringing this home, what does an implementation of it look like, sort of, practically?
[18:01] I have to be completely honest with you, at this point in time, it looks a Frankenstein creation because we have to stitch together a lot of technologies that exist and if they weren’t designed for this model of reconfiguring, being reconfigured in this way.
[18:19] So, the foundational technologies that I have used so far are more or less the same. For example, for your input, what I call input port, is where the data actually is coming from that gets transformed. You still have your ingestion mechanisms that exist today, whether you are hooking up to some upstream event stream, or you are hooking up to some API to get the data, you’re doing CDC, again, some sort of a legacy system. So, those ingestion mechanisms remain the same.
[18:50] Your transformation code that is encapsulated by this data point and does the transformation… Again, those are flow-based programming models that you have. So, a lot of people still use a Spark or Beam or whatever. If you have a very simplistic transformation, you’re running just a federated query. And the output of that query is simply your transformation. So, you have your usual suspects around transformation and code work orchestration.
[19:18] And then, on the output side, you are providing, kind of, a little bit of a high level APIs, whether it’s Rest or GraphQL that really redirects you underneath to a polyglot… I guess, a storage of a kind. Either it redirects you to a lake storage, an object storage, or redirects you to a table on the warehouse, or whatever storage is meaningful for that data product.
[19:45] For all the discoverability stuff is additional code that you write. I mean, this is libraries that you develop as part of your project to standardize. ‘Okay, what metrics do I want to expose?’ And you will have code, just handcrafted code, for exposing those extra metrics. And if you want to on top of it, at the mesh level, kind of, plug in, kind of, discoverability and catalog. With some effort, maybe you’ll be able to use existing ones, but maybe you end up writing a simple catalog on top of it.
[20:18] So, when you put all of this together, it actually looks pretty ugly because we are stitching technologies together in a new way that weren’t designed to be stitched together this way, and you run into limitations very quickly. Because nobody assumed that you will be sharding your storage accounts, you will be sharding your computation based on this little data products and having hundreds of these, I don’t know, lake storages. They assume maybe you have 1, 2, 10, 200, 128, something along those lines. So, you run into limitations. So, I’m hoping that we can move the technology needle forward as well.
[21:04] So to this point, are there, can you build a data mesh with the existing tools? You mentioned some of the existing things like Apache Beam, Flow, all those things…. To be successful at drawing out a data mesh, does it mean that new tools need to be created? New standards? Or can you make do with what we currently we have?
[21:26] Yeah. I think we have no choice but to start with… If you were starting today, we started three years ago, we used a ton of stuff that already exists, but we also built a ton of stuff. I mean, I don’t like to go to every plant and say, look, this is great. You can use the technology that we already have, but you have to commit to this two, three year, kind of, program of building out capabilities in your platform that you just simply can’t buy.
[21:56] We still have to get the technology to fill the gaps and where those gaps are to me, a lot of it is around interoperability. We have that wonderful diagram of tools that you have in your landscape. But if you zoom in, there are very few tools that actually interoperate and play nicely with the rest of the tools, and the standards that it gets to be created around, kind of, the expression of storage agnostic modeling of the data.
[22:30] So, I talk about, in this data quantum, how time access has to be a ever present parameter in the data sharing, because the only way we can have our cake of distribution and copying data as you wish…. This is going really bad, the analogy of cake. But what I’m trying to say is that the only way we can distribute data and yet have global consistency of the data, no matter how the mesh is transforming the data, we have to build in some really basic construct like temporality, immutability. These constructs just yet don’t exist.
[23:12] So, have some sort of a temporal representation of data agnostic to serialization modeling. So, there’s the standards around data sharing, standards around policy configuration, standards around access control. So, access control, right now, is very proprietary in data world to the platform you’re stuck in, compared to the API world where you have some sort of a standard. So, some of those standard pieces need to come.
[23:41] Maybe a last question from me because I want to then turn… We have actually a lot of questions in the chat. So, just, if you fast forward 10 years and everything works out as planned, what does that look like, in terms of the life of developers and data scientists and how does that get changed?
[24:01] Well I’ll have a smile on my face. We’ll have a lot of fun to do the data works. That’s the first one. I’m looking at a bit of a crystal ball and what I have seen has happened in the past. So, hopefully it makes sense to the audience, but I think one of the big changes would be, we’ll move from this specialized and specialization to generalization. So, some of the things that we consider specialization today, like data engineering. A large portion of what we call data science becomes basic engineering. So, I think that has to happen, otherwise we will never scale to meet the aspirations that we have. So, move from specialization to generalization is one.
[24:44] I think we will rethink… if data mesh happened, when we talk about data, we imagine something very different. We imagine this, kind of, lively, ever-changing thing that has an agency to govern its policies and to keep the data alive and provide APIs. We don’t think about it as a byproduct we dump somewhere and we build technology on top of it to get access to it. So, I think… I hope that our imagination around what constitutes data changes.
[25:17] And we really, truly become data driven, in a way that we can get access to data safely and securely, no matter where that data is and who owns that data. As long as, of course, we have the permission to access that data. So, there is no constant moving data from one platform to another platform needs to happen. Platforms have opened up.
[25:41] I think, finally there is a conversation that we haven’t had, and I haven’t talked much about it, is the sovereignty, right? The sovereignty of the data needs to be something that, as a policy, we need to build into the data to really give the control back to the real owners of the data, right? You and me and everybody else on this call. So, that gets baked into that data quantum as a policy.
[26:07] Great. All right. That’s fascinating. I have a bunch of questions. Can we try to do a few of those as rapid fire? I don’t know how doable that is because some of those questions are very good and maybe hard to answer quickly, but let’s try. All right. So a question from Makiel ‘Is there a recommended migration path from an enterprise data lake to a data mesh?’
[26:41] Yeah. I think you start going backward from your consumers of the data lake. So, look at, ‘Who’s accessing it? Why they’re accessing it? What data do they need?’ Work backward and go back to the source. So, go back to the source of the data, to those domains, where they are generating the data, incentivize them. So, that actually then incentivize to share that data. Use the lake technology still as an underpinning storage technology, but remove it as an intermediary place to dump the data and get the consumers directly consume the data from… and, when I say the source, I don’t mean the application database. I mean, the data quantums, the data products that you provide, in addition or adjacent to… as close to the source as possible. And if there are some intermediary, kind of, aggregate data products that you have to build to do that.
[27:31] So, work backwards from the consumer and discover the data products you need to build and incrementally build those data products, owned by people that are most suitable to have the long-term ownership of that. And then they’re part of the domains that actually that data comes from.
[27:48] Question from Jacob around immutable data. So, ‘Could you elaborate on how immutable data would be used for joining, if say a consumer wants to join with a current state of entities?’ I’m reading the question. ‘Are there patterns or resources that already exist that explain best practices dealing with immutable data?’
[28:10] Yeah, I think this is a big conversation to have. The data can be only immutable if we build two timestamps into every single… If every single, kind of, representation of the data and those two timestamps are when something actually happens and when we process that information, our understanding of that data.
[28:28] That three pieces, actual data, the event or state, at the time that it happened and the time that we process it, that piece is immutable. And you can see any of those parameters can change. For example, as our understanding of the… Let’s say Matt and I talk to each other, we have this conversation and this… And you view it tomorrow. So, your… Let’s say your system is processing this video tomorrow. And let’s say there was a mistake in that processing, and you have to reprocess that video, maybe the transcript… you were processing the transcript, and you have to reprocess the transcripts because there was a mistake.
[29:04] So, then the new piece of data would be this video, the new transcript that was happening… happened at the same time today, but it was processed tomorrow, right? The next time. So, that’s what immutability really means in data mesh. And you can always arrive at a state at a point in time or look at the differences between two points of time. If you want to join, you can always join in across a point in time, across all of those data products. But, as long as those timestamps are baked in, you can always… do join as you wish. You can always say, just give me the join of my latest, but it has to be built into every data product. And I think this is the one I have to go to a little bit of war, not war, but a bit of a, I guess, conversation around how to really build it because there are not many technologies that think about data that way and that has… in my mind, has led into a ton of accidental complexity that we deal with constantly updating the data once we’ve processed it.
[30:11] Great. All right. There’s this bunch of very good questions and very thoughtful questions, we’re sort of running out of time. Maybe I’ll just ask the last one because I guess it opens the door to ways for people to learn more. So, question from David as Thoughtworks or others set out a program of work that lays out what needs to be done, to have a reference implementation of a data platform or example or data product. So, I guess how do people take the next step in terms of learning and setting this up?
[30:48] Yeah. So, in terms of learning, I mean, we’re doing our best to generate as much content as possible. So, my book is a piece of that.
[30:58] When is the book coming out by the way?
[31:02] So, hopefully end of this year, the digital version, if we just work very hard ‘til December, but worst case, it will be early next year, but the print will be early next year.
[31:14] We’re also trying to, kind of, extract some of the best practices of what we’ve learned at this point in time. Again, this is a fast evolving space. So, the reference implementation that we’re hoping to put out, again, sometime in the next few months is going to be using the existing technology and tools to bring some of these paradigms to life. And I hope that that would be out of date, the moment it’s out because I’d really like to see the technology moving faster and we use, new ways doing that reference implementation. So, we will publish that open source. We’ll have an open source reference implementation. We are working on it internally.
[31:55] Great. All right, well, this was fantastic and I’m just realizing we should probably have done this as an hour long session because it feels like we’re, in some ways, just scratching the surface and we have a lot of good questions. We’ll save those questions. I don’t know if there’s a way to answer them offline, but we can figure that out.
[32:16] In any case, I just wanted to say a big thank you. This is super interesting and feels like a glimpse into the future for our world of data geeks and data infrastructure engineers, analysts, all those folks. So, thanks for pushing the thinking and very excited to see how that develops over the next months and years.
[32:42] Thank you so much for having me. Thank you for those wonderful questions so we can, hopefully, find another venue to answer them.