As the volume of data in the enterprise continues to explode, with ever large amounts stored in data warehouses and data lakes, the problem of data discovery has become an increasingly painful one. How do data analysts, data scientists and business people find not just data, but the right data for the problem they need to solve? How do they know how it was produced, how recently it was updated and whether that’s the right dataset they need to use? In addition, from an organization’s perspective, there’s a question of data governance – how to manage access in a way that preserves data security and privacy, and ensures compliance with data protection regulations (GDPR, CCPA, etc.).
Data catalogs have been a powerful response to those problems, and that category has seen renewed activity in the last couple of years with a whole new group of startup entrants.
At our most Data Driven NYC, we got a chance to chat Mark Grover, co-founder and CEO of Stemma and the co-creator of Amundsen, the leading open source data discovery and metadata engine. Mark built Amundsen while he was a product manager at Lyft and started Stemma to offer a fully managed Amundsen.
It was a fun conversation about the space. Below is the video and below that, the transcript.
(As always, Data Driven NYC is a team effort – many thanks to my FirstMark colleagues Jack Cohen and Katie Chiou for co-organizing, Diego Guttierez for the video work and to Karissa Domondon for the transcript!)
[Matt Turck] I’d love to start at a high level and maybe define what a data catalog is and why there is a need for it in the first place?
[Mark Grover] Totally. I may actually take one step back and talk about what the problem is and then come to why data catalog is a solution and what the hell is a data catalog. The problem is over the last decade, we, as organizations, have gained, generated, carried too much data. A lot of it. It’s not a surprise, if you look at the innovation that’s happened. There’s innovation that’s happened in data warehousing space was with Bitquery, Snowflake, and other newer data warehouses, making it really easy for you to store tons and tons of data. Then, there’s innovation that’s happened in the ingestion space, with things like Fivetran, Stitch, that let you bring in more data very quickly. Then, we have said, “Okay, through various different systems like Airflow, and DBT, you can now transform more data.” And we’ve said, “Okay, we have these BI tools for making analytical decisions, like Tableau, Mode, Looker, things of that sort.”
[1:22] All of a sudden, we have said, “Organizations need to be more data driven.” Previously, it was only the analysts’ job to crunch numbers, but we have said like, “Hey, if you’re a PM, an operations manager, and marketing manager, you got to look at data to then decide like, how are you going to use it?” In one way, technology and innovation has made us bring in a lot of data. Our lakes and warehouses are just brimming with data.On the other side, we’ve handed data usage cards to all these people in the company. The problem that’s created is all these people are coming into the data systems and have no idea and context about the data. What data is out there? How do I use it? Who else is using it? When was it last updated? Who can I ask questions? What has already been built on it?
[2:17] That’s the problem that has become even more acute today. It has existed in a certain number of companies in the past, super acute today. The historical way of solving this problem has been to, what I call, their data catalog is a system that lets you sort of get metadata, which is what data exists, who uses it, what’s built on it, where does it come from? Then, power use cases around discovery. I’m going to do analysis on ETA, where do I find this data? Or the use cases around transformation or migration – I’m moving from HubSpot to Salesforce, I want to replace all my dashboards and have them point to the Salesforce data. That’s the intent of the data catalog system. It serves use cases around productivity of data scientists and data engineers, as well as use cases around compliance with regulations and various different domains.
[3:12] Maybe double clicking on the latter parts of governance – why does governance matter?”
[3:18] Yeah, totally. I mean, governance matters in those two ways. The first one is around productivity – so your analysts, and your product managers, and anybody who’s going to make use of data during their job is effective at using that data. They may have skills like being able to write a SQL query or interpreter dashboard, that are generic, but they don’t have the organizational context. On governance, on the other side, which is compliance with regulations, those are actually domain dependent. There’s the regular revelations here, like GDPR, and CCPA, that apply to almost all organizations. Then depending on the domain, if you’re a financial company, like there are many in New York, you may have certain other regulations that you have to fulfill. These could be around, “I’m reporting this data to a certain auditor, and I want to be able to prove that there’s no manipulation that’s happening outside of what I already know, during the process.” Or, “This particular system is regulated and should not have any sensitive data outside of these bounds.” Understanding what your data is in that system, and what are the bounds there, so you can alert our other style of compliance and governance requirements that are pretty top of mind for users.
[4:35] So that’s the problem. Like a lot of great knowledge entrepreneurs, you experienced it firsthand at Lyft. How did you guys go about solving it? Maybe just walk us through the birth of Amundsen as an open-source project?
[4:50] I’ll start with telling you how had that problem been solved, until that point, and why was that not good enough. Up until that point, I’m talking 2017, 2018, the problem had been solved by what I now call curated data catalogs. A curated data catalog is a catalog that generates essentially blank wiki pages for your data. You may have a wiki page for a data set in ClickHouse, or in Snowflake, or in Bitquery, something like that. Then you may have a blank page for your Tableau dashboard. Essentially, you get a bunch of people, users, to populate these blank pages. The information you populate is this particular table, or data set is the blessed data set for these kinds of use cases. This data set is often joined with this other data set, that the canonical dashboard for viewing a business’s health metrics are here. Right?
[5:49] Literally, manual notes right?
[5:52] Correct. Yeah, and so the problem with this approach is, A, it takes a very long time to value. It takes you, depending on the size of the organization, anywhere from a year to three years to actually get all this metadata in, to then hand it to your users, or your compliance folks and be like, “Let’s use it.” The second one is the moment you write it, in any company that’s evolving, which is every company, this gets out of date. It’s just a matter of whether it gets out of date tonight, or if it gets out of date three months from now. This has been the status quo of solving this problem. What has happened is, there is a newer breed of companies that are all cloud, fast growing, product lead, and it didn’t start yesterday, it didn’t start in 2017, it actually started before that.
[6:37] These companies are growing at such a pace and there was growth in two dimensions. One is the amount of data they have, the second is the number of people that are going to use this data. Then, these two companies, when you have one or more of these two criteria met, that system breaks. You cannot rely on curation as the source of discovery, understanding context about a catalog. Therefore, there’s a need for what I now call automated data catalogs. An automated data catalog cannot guarantee you that this is the single source of truth, but it can tell you that out of these 200 data sets related to X, related to, I don’t know, pricing, these 180 are no good for you and your use case, because they haven’t been updated all that while, nobody else in your team uses it, there are no dashboards built on top of it, all that kind of stuff.
[7:27] And then, “These are the 20 that you should dig into.” It won’t tell you like, “Hey, this is a guaranteed source of truth,” but it will really help you focus down on the ones that makes sense. Then, you bring in the rest 20% of the curation to get the last mile benefit here. That’s the reason why previous tools didn’t work and led me to push for a new product being created. I’m happy to talk more about that, too.
[7:52] How does the automation work? So the metadata gets automatically captured from the various sources and is actually pre-populated? Is that the idea?
[8:03] So let’s dig in. The main systems that you need an automated data catalog to integrate with are your data warehouse, and this could be a data lake, data warehouse, and stuff of that sort. From there, you get information around what data sets are present or what tables are present. I use those two terms interchangeably, as well as how they’re being used. A data warehouse usually has access logs that you can take out and parse them to generate like “It’s often joined with this other thing and Jack often uses this data”, so on and so forth.
[8:35] The second system we integrate with that an automated data catalog needs to integrate With is the BI tool. It has all your information about what dashboards are viewed, what dashboards are being built from what data sets, who are the people who view those dashboards.
[8:48] The third system is usually an HR or a team hierarchy system so you can figure out that these people are on the same team and therefore be able to suggest interesting data or metadata for them to use based on what your peers are doing. Then, there are transformation systems. You know how often or when a data set is usually transformed into another, you can understand the lineage between one data set and another, lineage between one data center and a dashboard.
[9:14] Lastly, the last ones are collaboration systems.Often, I don’t know how it is in your companies, but the companies I work with, as an analyst, you imagine… Have you seen one of those memes where it’s like, “What my parents think I do, and what my co-workers think I do, and what I actually do?” This meme comes to my mind when I think about the work of an analyst or a data scientist in some of the companies that I worked at. The expectation is like, “Oh, they spend all their time in like modeling and that modeling can be analytical or algorithmic.” The reality is I feel like they spent all their time on Slack and they’re like, “Hey, what’s the source of truth for this data? Where do I find this? Has someone else used it? Does this column mean”, all this kind of stuff.
[9:59] It’s important that we integrate what kind of comes back to the last point, an automated data catalog also integrates with your collaboration tools, that’s your Slack, for example, and links conversations about data that have interesting context back to the data catalog. Then, it ends up becoming like a Google search for your data based on all these metadata systems and kind of creating a page rank algorithm, showing you various different intricate pieces of information from a few of these systems.
[10:29] On the access side, you mentioned that you integrate with the HR systems so that you can infer who’s part of what team, but in terms of giving access to some people to certain types of data, can that be automated? Or is that like the 20%, where you’re like, “Okay, well Jack has access to this, but Matt doesn’t?”
[10:51] When people talk about access control, they talk about one or two different dimensions. One is access within the data catalog. One person may be able to search for something while the other may not be able to. The other one is access in the actual data systems itself based on some classification that may be present in the data catalog. Which direction were you thinking, if any of those at all, Matt?
[11:14] Sort of both. I mean, mostly, Matt should not access GDPR data for whatever reason, and therefore, the catalog should make sure that Matt doesn’t access those.
[11:32] That’s a great question. The way I have done this, and I continue to think this way, is that often, you would have these data sets or assets in the company that would be locked down only to a specific set of individuals. I have found though a lot of time being wasted in this catch 22 problem, where Matt doesn’t have access to this GDPR data, but Matt needs to do something that requires him to have access to this data. But due to the lack of knowledge of even existence of this data, Matt can’t do the job. One default stance that we have is that within a certain deployment, and sometimes the deployment is for the entire company, sometimes the deployment is for a certain line of business within the company.
[12:14] Within a certain deployment, discovery of the data, the fact that there is a GDPR data set should be public information, but you can’t really find out anything of consequence or access this data without explicit permission being given to Matt in order to query this data. That’s one stance we have. Then, the permission management actually, the catalog facilitates that in the sense that you can press a request button and it makes an API call. But actually, the management of that happens outside the catalog. That, at that point, is a partnership with either a product that provides access control or built-in access control in a data warehouse system like Snowflake, or Bitquery, or something like that.
[13:10] A big part of success then is integrating with a bunch of different sources. I mean, it sounds like you, just to recap what you said, part of the modern data stack effectively, you integrate with data warehouses. But is that part of the idea? How do you determine the whole universe of integration? Because it’s sort of limitless, right? Especially, we had Zhamak at the last event – we talked about the data mesh. This is only going to get more complex and decentralized if we believe that the world is going towards the data mesh concept. How do you ensure that you’re not constantly running around trying to connect to the next source?
[13:52] I think it’s a great point. I would say, actually, that’s one of the things – that’s the primary reason why you need a product like this. First of all, the modern data stack is not a bundled one, it’s an unbundled one. You have your data storage system, you have your ETL system, you have your transformation system, you have your BI tool, and none of these tools are packaged by like IBM and you’re buying one behemoth IBM platform. You’re getting these products that are unbundled.
[14:21] In my opinion, I feel strongly that’s the right thing to do and you get the best of breed products, but it creates problems around management and governance that are net new and need to be solved in a new way. That’s where something like a data catalog can really help. But as a business owner running this business and having to integrate with all these systems, it is also something that we pay the cost for. Thankfully, the cost hasn’t been huge. I do think it’s the right thing to do for us to evolve our integration when new systems emerge.
[14:59] Our guiding light is our customers. While we can choose to some degree what kinds of customers, and what verticals, and what industries, and what markets we focus on, we find that most of the customers that want a modern data catalog also have the modern stack. The modern stack has actually not as exhaustive of an option. If I were talking about data warehouses, I keep coming back to these two, but in reality, there’s probably 50 different data warehouses, if not more. The thing is that many of them, the ones that are commonly associated with the modern stack happen to be the same over and over again.
[15:36] Coming back to the Amundsen story, you and others started the open-source project at Lyft. How long did that last? Then maybe walk us through the birth of Stemma as a separate company?
[15:51] Yeah, totally. Continuing with the story, this is a big problem at Lyft, too much data, too many people wanting to query the data, no one’s got any clue what’s going on. This idea of an automated data catalog stuck with me. We started building this product internally with the target persona being an analyst and a data scientist to just really make them effective. We did a hackathon, which was a very quick throwaway version just to see if we can de-risk the project, and then built the real project over the next few months. We launched an alpha of this project to 10 users and they were like, “This is the best thing we’ve seen.”
[16:33] We quickly got their feedback further, changed it to launch it to beta more publicly to everybody at Lyft. At Lyft from that day, which is probably mid 2018, until today, this product is the single highest CSAT scoring internal product. It sas 750 users every week. Lyft only has 250 or so data analysts / data scientists. There’s all these other people who have started using this product because the barrier to entry to them to using data has been further lowered. You have these 750 users using it every week.
[17:15] Those are product managers, like, business folks?
[17:19] Yep, exactly. I wrote a blog post, I believe it was late 2018, called ‘Introducing Amundsen’. Amundsen was not open source at that time. That blog post kind of caught on fire. It was like, “Oh, we need a system like this.” We said, “The concepts here are pretty universal, it’s just that the integrations are very Lyft specific.” What happens if you open source this so people can use the concepts and build their own integrations? That project became what Amundsen is today. Super highly used, over 35 companies using as open source, ING, Instacart, Asana.
[17:54] Then over the years, I found that the kind of company that will put together resources to establish an open source project, that kind of velocity we needed in order to move the roadmap in a direct partnership with companies that really needed to move fast was best suited with a commercial company around it. That led to the birth of Stemma, because I felt like we could only solve this problem for a very tiny sliver of companies. We had to solve it for everybody. Stemma was founded in 2020, last year, and we are funded by Sequoia, and solving this problem of data discovery, data cataloging through automation, and have been serving our customers, and growing that way. That’s the journey thus far until today.
[18:49] The data catalog space is reasonably competitive. I mean, obviously, it’s a big opportunity, big market. It seems to be startups, scale ups I guess, that have been at it for a little while, the Collibra’s of the world, Alation, both of which we had as speakers at this event over the years. Then there’s a new generation, certainly you guys, but others as well, like Metaphor and Castor, Atlan, a number of different players, which are all in our Data Landscape, for those that want to zoom in on those categories. How do you think about those? How do you position and how do you eventually win that market?
[19:47] So a few things here. One is that there are kinds of catalogs that are more command and control catalog. This is a product, but it’s actually a manifestation of the culture of the company. If you are in a company where data is very tightly restricted, and you don’t want to democratize access to certain kinds of data for others to use, and make decisions with, these catalogs work really well. You curate them, but you also end up creating these heavy workflows that requires going to a university to understand how to manage and orchestrate these workflows, but also a team of people who are on the other side of these workflows sort of approving anything as small as updating the description of a column to as big as granting the request to access.
[20:37] Some of these you would have in any company, but some honestly only exist in companies that have a very sort of command and control culture around data. Then there are companies that want to actually evangelize, democratize the right data for decision making to everybody in the company. The first big difference that I share in the offering is the command and control catalogs, which are often curated and workflow based, while the other one’s being more democratized and automated in both the metadata they get and then the users they support. Stemma very clearly is in the latter category. We do not do well in serving the command and control style organizations.
[21:22] The second thing I would say is there’s a tendency in this space to build products that are platforms. An example is you have a data catalog that provides a search or a browse experience. But in addition to it, you will put a BI like tool so you can write queries. In addition to it, you’ll create a wiki page style tool, which allows you to write more detailed pages about data. You will have a conversation system in the tool so you can ask people questions and respond, kind of like those comments in the Confluent page, things of that sort. The problem with that is if you’re a very small company, that’s great. You can buy one thing, you get all the tools, they’re all very well integrated.
[22:01] But like we were alluding to earlier, the modern data stack is unbundled. That means in any modern organization that’s non-trivial sized, I would say like 300 employees or more, you would find that you already have tools for your BI, you already have Slack for conversations, you already have a wiki-like tool for Confluent or Notion or something like that. It’s very important, in my opinion, the right thing to do is to integrate with these best of breed tools instead of building a coherent home, or platform, or something like that.
[22:34] That’s the other dividing line I would put, is that there are some catalogs that tend to choose a more of a home or a platform approach that work really well for really small organizations. But when you grow, you have to integrate with the ecosystem the organization is in. We, for example, we have no feature to have a conversation in that data catalog, and that’s intentional because we want to integrate with Slack. That’s why there’s a Slack bot and not like have conversations in Stemma. We don’t have a BI tool in Stemma, we integrate with your Tableau or Looker, things of that sort. Those are sort of the dividing lines I’ll put in how we think about the world.
[23:09] Really interesting. All right, let’s maybe close this discussion, because there’s a couple of questions in the Q&A that I’d like to talk about. Let’s talk about go to market, open source versus the commercial product. How do you view both of those work in tandem from a product and go to market perspective?
[23:45] Some context here is before Lyft, I was at Cloudera. I spent about five years working at Cloudera. I was a lowly engineer, so I didn’t have a whole lot of context into strategic decisions around open source versus proprietary. But I had a lot of experience being on the receiving end of those decisions, and being a part of communities, and seeing how that was working, and straddling the line between open source and proprietary in the Cloudera world. Cloudera had a competitor, Hortonworks, which was all open source, and simply selling support and services. I’m going to take my data practitioner hat off and put my business owner hat on.
[24:23] I find that selling support and services on open source products is not a great business model and that’s not a business I’d like to be in. That means you’re building a product around your open core. Then, the question then becomes what principles do you build around which you can tell your community.-What is open-source and proprietary? Those principles also help you determine if the model will work for you as a business or not. Then depending on that, it’s a numbers game in the sense that you need to have communities of a certain size in order to convert enough people, in order to actually have the traction. Or in some cases, the example I’ll use is like Segment and Analytics.js I think it was called.
[25:18] There was a library they put that was open source, but then they built the product around that library. At the end of it, there’s no relationship between that open-source library and the product. I think that’s a fine path to take. For us, it’s just depends on the nature of the product, the kind of buyer, and the space you’re in that helps sort of make these decisions. But we’re definitely not like a support and service only company, I don’t think that’s sustainable.
[25:56] What are two or three industries that are utilizing Stemma currently? Either Stemma or Amundsen.
[26:06] For Stemma, the most common industries happened to be financial services, like the republic finance companies that use Stemma. Then also, technology companies that are fast growing. They may be product led companies, but they also may be in other spaces, but they’re just really fast growing companies that have a lot of data that they want to democratize, and enable the rest of the organization to use. Those are sort of the kind of heavily regulated industries. Then, fast growing industries are the two most common places where we see.
[26:46] A more tactical question that you may understand better than I do from Marlon, who asked, he was actually early in the conversation where we were talking about the governance. Marlon asked if governance should be managed at the domain level?
[27:03] There’s probably a lot to un-parse in that question. I’m afraid I don’t have an opinion that’s a dichotomy for me. To me, since you’ve talked to Zhamak, and she was on this here, I’m all for decentralization. There’s governance teams, or data teams, are in the middle of data owners and data consumers. What we want to get out of the data teams is not be a bottleneck. Sort of enable these two people, groups in the company to talk. Having said that, I don’t think we wash ourselves off of the responsibility of enabling these two groups to talk.
[27:43] Enabling here means that you need to provide tools, artifacts, and sometimes processes to help these two groups talk. An example is, let me give a concrete example. An example is the owner of the data wants to deprecate a particular data set. They need to talk to the consumers of the dashboards that view that dashboard that is powered by the data set in question. But they have no tool for them to see like, “Hey, who is using my data? How are they using it? How often?”
[28:17] It is the central team’s responsibility to provide that tool. It’s the owner’s responsibility to use that tool and then communicate directly to the consumers. Hard for me to say that governance should be managed at domain level. I’m not sure if I’m answering your question, Marlon, at all. But I think there are actual activities that get democratized and sort of pushed out in decentralized domains, and then their activities, mostly around [inaudible 00:38:55] that we have to pull pick up as centralized teams to enable those groups to talk to each other.
[28:47] All right, very cool. Well, on that note, that’s a wrap for this conversation. Really appreciate it, Mark. What’s great, like I really appreciated that you made this educational and I think given the opportunity for folks to learn a lot. This was great. Thank you so much. People can follow you at Mark_Grover on Twitter. What’s the website for the company?
[29:15] Stemma.ai.We couldn’t get the dot com address. If any of you are starting a company and you have a name in mind, go grab the dot com address and get working on that soon.
[29:24] Well, ai is pretty cool as well. You have a great blog on there that I encourage people to go read, it’s very, very interesting stuff. Cool. Thank you, Mark. Really appreciate it.
[29:35] Thank you for having me. It was a lot of fun.