The world of data governance is not the most visible part of the data revolution, yet it is of critical importance. As more and more data floats into the enterprise, and its role is ever more mission critical, one needs to be in full control of it – understand where data resides, who can have access to it, which datasets can be trusted or not, etc.
Enter Collibra, a startup that has had a long march towards success, as it was founded in 2008. Collibra has now become an impressive industry leader and raised a $250 million Series G at a post money valuation of $5.25 billion last year.
We had had the chance to host Stan Christiaens, the co-founder and CTO of Collibra at Data Driven NYC in 2017 (video here), and this time we got a chance to chat with the company’s CEO, Felix Van de Maele.
We had a great conversation, starting with a round of definitions that should be interesting to anyone curious to better understand that side of the data world.
Below is the video and full transcript.
(As always, Data Driven NYC is a team effort – many thanks to my FirstMark colleagues Jack Cohen, Karissa Domondon Diego Guttierez)
TRANSCRIPT [edited for clarity and brevity]
[Matt Turck] To make this educational for everyone, let’s start with a round of definitions. What does data governance actually mean?
Governance is all about trust and how do we make sure we can trust the data? We can trust how the data is being created. We can trust how it’s used, managed, and from a theory perspective, it’s very much a policy setting exercise. Again, how do we define our policies that govern the way data’s being created, data’s being managed, used, consumed and so forth.
I think in practice, if you talk to organizations that implement data governance – what it really typically looks like is they typically start with a business philosophy, how do we make sure we have one shared language so that we can actually understand each other. If I tell you Matt, we have 5% churn, you probably want to know like what does that mean? How do we calculate that?
So agreeing on definition is absolutely important. It creates a shared language that we can actually understand and agree on what we talk about. Then second step typically is about stewardship. We assign roles, responsibilities, again, the organizational aspect. Who’s the data steward for our customer data, for example. Again, if I have a problem, who do I go to? Who’s responsible to solve it? Who gets to make a decision of how do we define what a customer is?
Third is then around policy management, again, defining the policies, who can access the data. What do we do with sensitive data? What do we do with privacy data? How do we think about security policies, quality policies, and so forth? Then data help desk is typically also a big aspect to it. If there’s a problem again, how do we reach out? How do we solve that problem? When it’s an IT issue, typically you log a ticket at ServiceNow. It’s not that obvious when you have a data issue. I think a core part of all of this is this understanding that we actually need a workflow engine, a business growth engine to help people work together effectively. So in practice, when you hear organizations implement data governance, that’s typically what it looks like.
What is metadata management?
Metadata management is a very technical term. It’s not a new term. We’ve been talking about metadata, metadata repositories for 30 years. Ever since we had databases. So typically metadata management is really the technical metadata, the table, schemas, columns that you manage. I think if you go back to why all of this is important, why suddenly there’s almost like a Renaissance in metadata management. Although we don’t specifically like to talk about metadata management, we prefer data intelligence.
But I think we’ve seen so much innovation in this entire data landscape. You deal with it every day. We’ve been at it for 14 years now with Collibra, the world looks very different. I think a big shift was this democratization around data, especially the self-service analytics. Tableau, Click, Power BI, Looker. So many more people were consuming analytics and reports. Then we saw the shift to big data. It’s all about volume. I think now we’ve gotten a little smarter and now it’s all about the shift towards cloud. We’ve seen this massive innovation again on the infrastructure side. How do we store data, process data, the whole interestingly what’s happening around data ingestion, ETL, ELT, the whole decoupling there.
So there’s been a ton of innovation on the infrastructure side, the tooling side. But I think what has happened is that the level of complexity, the level of fragmentation, the level of distribution has only increased. So it’s only become more difficult for people to actually find the right data, understand that they can use it, make sure they understand what it means. So it’s only become harder for people to actually consume and produce data. Going back to metadata management, we believe a new approach is required, with a metadata management foundation – you really build almost what we call a system of engagement, a system of record for data. I think ServiceNow is a great analogy. If you go back 15 years ago, every company was investing in IT, it became chaos.
Our CIO came in and said, “Okay, we need control.” So IT governance happened. And then the foundation of IT governance was your configuration management database, your CMDB. That then evolved into IT service management. How do we actually automate all these IT workflows and how I think about metadata management today, the chief data officer, chief analytics officer, comes in, “This is chaos. We need more control. So initial reaction, data governance, we need to understand what data we have, where it is, who has access to it.” So we need to build that metadata management foundation, we call it the metadata graph and on top of that, you provide these capabilities to ultimately accelerate data processes, if you will.
What’s a data catalog and then the next one will be, what is data quality? What does that actually mean?
A data catalog is really how do you inventorize your data. Again, how do you know what data you have? It’s how do you catalog, literally catalog, all of your data sets, your metadata, whether it’s in the cloud, on premise, what are your tables, your schemas, your databases, your columns, what do they mean? I think of it almost like the Amazonification of data. On Amazon you shop for products, you can browse products, you can search for products, there’s context, there’s reviews, there’s previews, there’s things like that. Then you can almost check it out, and you have that almost shopping experience, and it actually gets delivered the next day in front of your door. The data catalog is often the same experience, but then for data sets. How do you allow a user, business analyst, to shop for data? It doesn’t really matter where the data resides, on prem, in the cloud, traditional databases or the new kinds.
Then data quality. What does it matter? What is it?
We’ve seen a renaissance in data quality, a lot of new data quality, data observability startups. Originally, it came from how do we make sure that our marketing database that we send mailings to, the addresses are correct? That’s where data quality came from 20 years ago. Today, obviously very different. Again, as part of this modern data stack, where you have all these data pipelines, you need to understand what’s happening in your data ecosystem, your data stack. So you need to start monitoring, observing, ensuring you understand the quality of the data as it flows to all of your systems. So I think of it almost like what DataDog is doing on the IT infrastructure front. We need to do the same thing on the data front. That’s why data quality, data observability has become so important. You have in production, machine learning models. If something breaks, it’s a real time issue that requires real time resolution.
Let’s talk about that concept of data intelligence cloud. I’d love to do a little bit of a deep dive into the Collibra platform. What is it, what does it do? All the components that we just talked about are sort of merged into one platform. So if I’m a company and want to make sure that my data is under control, I work with Collibra and then what do I have access to?
Our evolution as a company, we started from governance and started adding capabilities, data catalog, data lineage, again, how does data flow through your organization, data privacy and our most recent data quality. The way we think of it is like data intelligence, that’s really an organization’s ability to understand its entire data landscape. Trust that the data is used in the right way and then automate these workflows. So these are really the three big components, almost three big categories, everything about data intelligence. One is around governance, lineage and catalog. It’s all about how we make sure we understand what data we have. Second is around quality and observability, understanding what’s happening with that data through the whole architecture. The third is around privacy and security. How do we make sure we are treating PI data, sensitive data in the right way?
So these are the three big categories. Again, built on top of that metadata graph, tying it back to the previous discussion about network effects. That’s really how we think about network effects. You really want to build that understanding of your entire data landscape, start connecting the dots and start building that context. That’s going to give you the trust and understanding to make sure that you’re using data in the right way. I went through all the products that we have, combined on that one metadata graph makes our data intelligence cloud.
And at its core, when a company rolls out Collibra you have presumably a series of connectors into all the various repositories, whether on prem or cloud. You don’t move the data, right you just collect the metadata? How does that work?
Exactly. We don’t move any data. So we have connectors. We tie into all of your data systems on prem, in the cloud. We capture all of that metadata. Tables, schemas, columns, files, and so forth. And that’s how we built that metadata graph. But we are not in the business of moving, storing any of the data, just the metadata. Think of it an old fashioned analogy, like a library you have the index cards, that’s what we manage, the books, the data itself, wherever they are. That’s not something that we deal with.
And in that analogy, who are the librarians?
Great question. So that’s typically the data stewards. Then you have different personas. The data stewards typically are the librarians that are responsible to steward the data. To make sure we have great definitions. We understand where it comes from. To make sure data is being treated correctly, but there’s a lot of different personas. I talked about that Amazon-ification of data. If I’m a business analyst, I need to create a Tableau report. Or if I’m a data engineer, a data scientist, I need to create an ML model. Typically my first step is always okay, where do I find the right data? I’m in marketing. I want to do a customer churn analysis, or I want to build a customer trade model, I need customer data. I’m sure we have lots of different copies, but where can I find the right customer data that includes all of our customers, not just European customers?
How do I make sure I’m using that data correctly because it’s obviously very sensitive data. How do I make sure that legal signs off on this? Do I have to manually do this? How do we capture that legal has signed? So this whole coordination effort is something that we then facilitate and automate. So of course the data stewards are a key persona, user, business analyst, data engineer, data privacy manager, data scientist. These are the key users of the platform.
Who’s an ideal customer for Collibra, is that a large enterprise where there’s a lot of complexities. Is it a smaller, faster growing startup. Who’s best?
I’d say the bigger, the complexity, the bigger the chaos, the more value we can add. I think a small company has similar problems to a large company, it’s just at a different scale. What we’ve done really well, again seeing where we came from after the financial crisis, started to work with all the large banks. We’ve been very successful in being able to cope with the complexity of the largest companies in the world. We also have a lot of high growth companies that have a lot of complexity around data. You would be surprised that some of those digitally native companies, very data first companies, you would think they have all of the data in order. It’s definitely far from the truth. But mostly large companies, I would say because that’s where we can help the most.
You mentioned the modern data stack. I’d love to better understand how you see data governance in general and Collibra in particular. Sort of fit in some of the key trends that we’ve covered in this event over the last few months and years. The rise of the modern data stack, which is really this idea of having a central data warehouse, whether it’s Snowflake, which is one of your investors, by the way, or Redshift, or what have you. And having a flow of data from original sources through the warehouse into BI and other functionalities. Where does that fit? Do you sit on top of the data warehouse? Is the data warehouse just one of the many sources? How does that fit?
So think about the modern data stack as almost this data supply chain. Where you start the source data, your ERP, Salesforce, what have you – data ingestion, ETL, ELT. Again, lots of innovation happening in that space where you get to the storage, Snowflake, Databricks. Streaming is a big part of it. Typically you go to the AI modeling, the DataRobot, Dataiku, the consumption. We are not part of this supply chain in the sense that we don’t move the data. We don’t store the data. We don’t change the data. We sit above, but not just the storage, the data warehouse, for example, but it’s really across that entire supply chain.
One of the value propositions that we think of ourselves is we handle every user, every use case across every source. All the way from the source all the way to the reports, Tableau, Looker, Power BI and everything in between. And these are the three categories that we think of: data intelligence, run around governance, lineage and catalog. Understanding what data you have across that entire model data stack. Quality and observability to make sure, okay, what happens through these pipelines and how do we make sure we can trust what’s happening there? And then privacy and security that everything you’re building is compliant to regulations and security constraints. And so to your point, I think we definitely stayed on top across that entire supply chain, if you will.
The other big trend that people talk about a lot is this concept of data mesh. We had Zhamak, the author of the concept at this event a few months ago, which is really this idea of decentralizing the stack where different people own the data that they produce. Which is maybe contrary to the modern data stack and goes towards more tools, more systems, more pipelines – where does governance fit on top of this, and how do you build for that world of decentralization?
We’re big fans of data mesh. If you think about a data mesh, it’s really all about governance. It’s really, how do you do all of that. It’s almost like governance for architects, if you can call it that. Because it’s very much an organizational construct around decentralization. I think that’s absolutely the right approach. We’ve seen it clearly work well within engineering. The only way to scale is to decentralize.
If you look at all of these data repositories, data warehouses, they all argue that, just move all of your data in one place, and it’s going to solve all of your problems. We’ve been hearing that promise for the last 25 years, and it’s never solved all of our problems and it never will. We have to embrace the fact that data will be diverse, different, and decentralized.
So governance only becomes more important. If you think about some of the key principles in data mesh, this domain orientation where you organize across domains, it’s absolutely the right way to do governance. We talk about federated governance of centralized governance. If you think about data as a product, I think that’s, again, tying it back to metadata, thinking of almost like the usability around data.
Over the last 10 years I said, we’ve been way too focused on just storing more data. When I last talked with Zhamak she had a great quote around, we need less collecting of data, but more connecting of data. We have a lot of data. That’s typically not the problem. It’s not by storing more or having a faster database that we’re going to make our organization more data driven or better. It’s really understanding the context. Again, where metadata comes in. How do we understand the documentation around data? Where is it coming from? What is the quality? How is it being used? How are we allowed to use it? So again that usability of data as a product, I think is a really important component. So we are big fans and I think it’s absolutely the right way that data has to evolve the way we organize ourselves around data.
One question from the group here live, which perfectly anticipates where I was going to go next. Which is around competition. How do you compete with hyperscaler native solutions on that front, and more broadly, because what you do is so incredibly mission critical to any company that wants to deploy data, BI and machine learning and AI at scale, it’s a particularly vibrant part of the market. So this indeed the hyperscalers, as far as I know, Amazon, Microsoft, and Google all have some overlap. Then there’s a whole host of data quality startups, data observability startups. Then some of the older players like Click that bought PodiumData, and then there’s Tableau. So it is a whole vibrant ecosystem. How do you position, how do you differentiate and how do you win?
So we think of that ecosystem as three big categories. One is around what I call the point products, that do one particular thing really well, to your point like a data quality tool, a data catalog tool, a data privacy tool. The second one is what I call the incumbents, that I think have missed the boat to the cloud. And I think are going to struggle there to provide that experience. Then finally, to the question of the hyperscalers, of course. It goes back to our value position to your point: how do we win? I think there’s three components around being able to address every user, not just the technical user, not just the business user, but both. And I think that’s really important to every use case. Not just catalog, not just privacy, not just quality, but across the entire data intelligence spectrum.
It goes back to the network effect with metadata, metadata management, metadata graph as the union of bringing it all together. Then finally across every source – that’s really where the difference is with the hyperscalers. And again, we are great partners with Snowflake, which is an investor, Google, which is an investor. Amazon similarly. They’re great at managing within their ecosystem. They view purely technical metadata within BigQuery or Snowflake or Databricks and so on. Of course you’re going to have that in those hyperscalers.
But then how do you tie it to the business? That’s not something that they do. How do you tie to your organizational model, your policies, your quality, observability, that’s not something they do. And most importantly, how do you bridge across again, that entire data supply chain, talking about the modern data stack from source to ingestion, to storage, to consumption. It’s not all going to be in one place, but you want to provide that broad experience. We can call it that system of engagement, across your entire data function. And again, that’s a differentiation. I think it’s actually a really good fit.
I’d love to switch tacks a little bit and maybe talk about the journey and what you learned along the way. Because you’ve built a remarkable company, which again, is like a $5 billion plus valuation, 1,000 plus employees, which is a terrific success. Maybe walking down memory lane, you started the company a while ago now, I believe in 2008. Walk us through the beginning and in particular, how you nailed the initial product market fit, which is this sort of elusive starting point that so many entrepreneurs look for.
We started 2008, spin off from the University of Brussels. So academic background, this is my first job I’ve ever had. So textbook, I should probably change the founding story to a garage somewhere. But that’s how we got started. We probably started four years too early, to your point about finding product market fit. Just when a financial crisis happened. Interestingly, we were doing research on semantic technologies. We called it web 2.0 at the time as well.
It was all about semantic web and open web and linked data, not anything crypto related. But that’s when we started. And then had to fight four years to find product market fit. Actually the financial crisis helped us. That’s where we found product market fit in the financial services industry, compliance and governance related around data governance. Because all of the large banks had to comply with a lot of new regulations after the financial crisis.
They basically had to prove to the regulators that they were in control of their data. Like, okay, you give me a report that shows a number, explain to me where that number comes from. That seems like a simple question, but it’s actually a really, really difficult one. That’s what we helped all of the banks answer. Then we’ve seen this trajectory of data changing and evolving, and we’ve been able to ride this.
So finding product market fit around data governance, and then it’s going to rise around data. And it’s interesting if you look at the proxy, the rise of the chief data officer. When we started, I think there was one chief data officer at Capital One. Now I think there’s 3,000, 4,000, 5,000. So this rise of the chief data officer is almost like a proxy with the rise of our revenue. So it’s been interesting that we’ve been able to ride this wave. It’s interesting. I don’t want to say just getting started Matt,…
Of all people, you probably can.
14 years in and we’re still looking at so many new companies being started, doing what we do and it’s just an exciting place to be in.
How do you navigate a roadmap over such a long period of time? Because in 2008 the world was in a certain state and it was largely pre-cloud effectively. And sort of pre-big data and certainly pre the resurgence of machine learning and AI and today we’re in a completely different world. How do you build a product or platform? Make sure that the older parts are not completely antiquated while you build the new stuff.
It’s not easy. I mean, start with our vision. We ultimately believed that data was important. I think that clearly has been shown to be true and accepted by everyone now. I think to your point that continuing to innovate, you see these architectural, technological shifts is not always easy. It requires sometimes hard decisions. I’m really happy that four years ago we made this decision that cloud is going to be the future. We started this cloud transition architecturally from a business model perspective. It’s obvious right now, but four years ago most of the data products at the time weren’t in the cloud. We’re still Hadoop, we’re still Tableau, OnPrem. So doing cloud data, it wasn’t as obvious.
Yeah, it’s actually something people don’t really appreciate. Because there’s such a lag or a difference between what you read in the press and on Twitter and the reality, especially for the large, Global 2000 type companies. I completely second the point. That’s what I’ve seen with the companies I work with as well . The demand for data in the cloud four or five years ago in those companies, especially the regulated ones, was like zero. Nobody wanted it. So it’s amazing that you guys did the transition at that point.
Just keep adjusting and always listen. I mean, it’s all these clichés but keep listening to your customers, but also keep true to your vision and where do you think that they will be going.
Did you have an internal fight about it or were people saying, “Hey, no we shouldn’t do that. Nobody wants it.”
I think the fight was mostly with the sales people that said, “No, we’re never going to sell this. All of our customers won’t want to buy it.” We had a few deals that we lost, but it’s important to just draw a line and say, “Hey, they will come back.” But it’s also important to have the right architecture. Again, we don’t capture any data ourselves. A lot of our customers still manage their data on prem, and of course we’re fine with that. But because we only capture the metadata, we’ve built an architecture that is able to do hybrid, which is important. But ultimately the experience needs to be [inaudible ]. That’s the bar nowadays, but there were lots of fights. The biggest fight was all the way in the beginning. When we actually started going back to finding product market fit with semantic data integration. We wanted to do data integration better. We tried for two years, had zero customers. That’s when we pivoted into data governance. We were just way too early. It’s great to see the innovation now around data integration, ETL. But that was a big fight.
How do you scale the team? In particular there’s always this really interesting tension between promoting people from within, especially the people that were early in the company and then bringing experienced management that have seen the next level of scale you are in. What’s your philosophy on that?
It’s hard. I use this quote that I’ve stolen from somewhere I don’t remember where, that in the beginning you’re like a pirate ship. The only thing that matters is preserve cash and sell – build product or sell product. Over time, you need to build more like a Navy ship where it needs to be more structured, ownership, but more repeatable, more processes. Put a pirate on a Navy ship that’s not going to work. Put a Navy captain on a pirate ship that’s not going to work either, but that’s what you’re going through. That’s a massive change management exercise and you’re going to make mistakes. You’re going to bring in people too early. You’re going to lose people too soon. Unfortunately I think it’s part of the journey and I’ve just learned every year is different.
You have to explain what you’re doing, why you’re doing, why is it important? But I would recommend all founder CEOs – building a leadership team is probably one of the most important jobs that you have. I remember the first leadership team I built, I thought, these are all amazing people, and they all are amazing people. I thought that was going to be our leadership team for the next foreseeable future. Two or three years later, it was a different leadership team. So just to go through that really quickly when you’re growing and changing really quickly. I think it’s one of the harder things to manage as a founder CEO.
A little bit to that point and maybe as a last question, how do you personally, as a founder and CEO navigate this?There’s such a difference and attention between the skills that you need as an early stage founder, which is all about like being visionary and selling on your vision when you have nothing to show. And then the stage where you at. Which is effectively pre-IPO, soon IPO hopefully, where need to be this super efficient manager. You said, which I hadn’t realized that it was your first job, that you have never done this. At a very personal level – do you have mentors? Do you read all the time? How does one learn the job on the job?
Again, cliché, surround yourself – the reason that you need to build a great leadership team is exactly for that reason. Find great investors, great board members, mentors, again surround yourself. But I think it’s also, I think one around wanting to do it. Because I often hear I’m an entrepreneur. I don’t want to be a manager, but if you don’t want to be a manager, then it’s not going to work out. And that’s the decision that you have to make yourself.
So you have to want to be, have to want to do it. Also, I don’t think it’s all rocket science. It’s not that super complex. It’s all pretty logical. If you surround yourself well, and almost think of it – I have a product background. Software engineer. So initially you build a product and now you build a company and you have to think about communication, just like you have APIs and SLAs in a product you need to do the same thing on the company level.
You have to modularize, you have the components. So anyway, there’s actually a lot of analogies. And so it’s all not that super complex, but you have to want to do it. You have to, I think be super humble and always want to learn, having that growth mindset I think is super important. And then just surround yourself with great people that you can learn from.
One last question from the group since we’re just talking about this. How do you surround yourself with mentors in a remote environment? I guess to expand the question, any lessons learned in this pandemic? You obviously operate on a couple of continents and grew a global company. Any lessons learned in making a distributed team work well together and make sure that everybody learns and finds mentors and all those things?
I don’t think I have any silver bullets or secrets to share. But I hired a few new executives as part of the executive team without having seen them in person ever. So you just have to get over this very uncomfortable idea that you would hire a leader in your company without ever seeing them. But I think it worked out really, really well.
Keep investing and bringing the team together while still being distributed, continues to be super important, building that trust. And on the mentors, in a way it’s easy as well. The pool in which you can fish, so to speak, gets bigger as well. You’re not constrained anymore into the 50 mile radius, something like that. It’s all remote and plus travel. So I think you can actually cast a wider net. So in that sense there’s a benefit to actually find mentors more globally.
Thank you so much for joining us tonight and sharing all of this, including the journey, which is always fascinating and congratulations on everything you guys have done.
Thank you. Thank you. Thank you.
I realize you are just getting started. I am looking forward to seeing all the success compounding over the next few years. You’ve clearly built a very important company, so thanks again. Appreciate it, and thanks to everyone who joined us tonight.