Databricks is an enterprise software giant in the making. Most recently valued at $28B in a $1B fundraise announced in February 2021, the company has global ambitions in the data and AI space.
An unlikely story of a company started by seven co-founders, most of whom were academics, built around the Spark open source project, Databricks is heading towards a monster IPO that will accelerate its rivalry with its chief competitor, Snowflake.
I had a chance to interview then co-founder and then CEO Ion Stoica at Data Driven NYC back in 2015, when Databricks was a company very aggressively courted by VCs, but still very early in commercial traction.
It was a real treat to catch up with Ali Ghodsi, who took over as CEO in 2015.
Below is the video and below that, the transcript.
(As always, Data Driven NYC is a team effort – many thanks to my FirstMark colleagues Jack Cohen and Katie Chiou for co-organizing, Diego Guttierez for the video work and to Karissa Domondon for the transcript!)
TRANSCRIPT (edited for brevity and clarity)
[Matt Turck] I’d love to take a quick trip down memory lane and go back to the origin story of Databricks. So AMPLab, Spark, and Databricks, how did it all start?
[Ali Ghodsi] [0:23] It was fascinating. We were just at that cusp where AI was just about to get revolutionized. We were getting funding from the early startups at that time. Uber had just started, Airbnb, Twitter was in the early phases. There were smaller companies. Facebook was also a more sort of smaller company, and we got to see what they were doing. And they were claiming that they were getting fantastic results using 1970s machine learning algorithms.
[0:52] Most of us knew that that couldn’t be true, that those algorithms didn’t work, but they said, “No, we’re getting superhuman results.” And when we started looking closer, it was true. They were getting amazing results, beating anything that we’d seen before. And when we looked closer, it turned out that what they were doing is they were taking those algorithms from the ’70s that do not work, but they were applying orders of magnitude more data to it. So a lot of data on modern hardware and they were getting superhuman results, and we were kind of blown away by that. And we said, “We need to democratize this.” At Facebook for example, they could detect couples breaking up in advance, and we were like, “This is really powerful technology.” Imagine if this existed in every enterprise on the planet, what this could do to the business problems that people have. So that’s how the journey started, that’s what we started with in 2009 at AMPLab.
[1:44] What was Spark at the time at the AMPLab? How did it all come about? I read some story of the engineers on one side of the lab and then machine learning folks on the other side, how did that all start?
[1:56] Yeah, it’s actually interesting. So the Nobel Prize in computer science is called the Turing Award. And one of the Turing Award winners recently, his name is Dave Patterson, he was a professor at Berkeley at the time, and he was a big believer that we should get people together, we should break down silos. And the professors at Berkeley gave up their rooms and put all the students in one big, giant open area, open desk area. So we had mathematicians, we had computer scientists, we had the machine learning folks, and they were sitting next to each other. And this all was going on around that time.
[2:28] The machine learning problems they were trying to solve were just very hard to do with the technology stack at the time. And at the same time we were seeing Facebook, Uber, these guys do it really amazingly well and the people in AMPLab that were doing machine learning, the math folks, they had to use this thing called Hadoop, which was just terrible, it was not possible. They were telling us, they were complaining that it takes forever, every iteration of the data has to run a MapReduce job, it can take 20, 30 minutes just to do one iteration, and they needed it to be fast. So that’s when we decided, “Let’s join forces, let’s see if we can replicate what the FAANG companies have and build a framework that’s really, really fast. We’re doing lots of iterations over the data. So not just doing one pass, not just a SQL engine, but something that can do recursive machine learning and find patterns in the data at extremely fast speed.
[3:22] And by the way, the AMPLab has created some amazing things and Berkeley in general and Stanford. What does a place like Berkeley get in terms of going from academia to a startup like Databricks, which is a resounding success?
[3:41] I think most environments on the planet kind of have certain structures and you get institutionalized into them. And these are the rules, “This is how we do things here at this company, or at this university. And we follow these, we’re this way, we dress this way.” At Cambridge you’re not allowed to walk on the grass if you’re not a professor and so on. Berkeley’s unique, Berkeley’s kind of like, “Anything goes. You can change the world. Why not?” That’s sort of what it instills in everybody who lives in that city and attends that university, which leads to some interesting research that sometimes is kind of outlandish and not necessarily useful. There’s this research on delay tolerant networks, which was about how do we communicate if we completely invade all of the universe and we want to communicate with internet across all the planets, how would we do that? Someone went off and did five years of research on that.
[4:28] But it also lets you think what’s wrong with the current data ecosystem? And if we want it to make machine learning really work, how would we do it? So, I think Berkeley definitely has the spirit of thinking outside of the box and doing anything you’d like. The joke is, Berkeley will come up with an innovation that’s groundbreaking, MIT will ignore it. Stanford will realize that it can be monetized, so it will monetize it. And then once it’s well-established, MIT will come in with the best last optimal solution to the problem.
[5:01] One of the peculiarities of the founding story of Databricks is that you guys have seven or eight co-founders, which is very unusual. In retrospect, what were the pros and the cons of having a large group like this?
[5:23] There are pros and cons for sure. If you know how to actually get a tight-knit group of seven people to really trust each other and work well together, amazing things can happen. I think a lot of the success of Databricks was getting all these seven people to really trust each other and do great innovation. Very few companies have the pleasure of having that kind of critical mass of thought leaders together. The downside can be, oftentimes founders, this didn’t happen at Databricks, but you see it all the time in other companies.
[5:55] The early founders, even if there’s two of them, they fight and then they split up early on, within a year or two. That’s the problem. So, it might be too many cooks in the kitchen could be a problem. We found a way where we really know each other’s strengths and weaknesses, and it’s made this journey an absolute pleasure for me. They always say the CEO job is the longest job on the planet. I never felt that way. I had lots of early co-founders with me that were always there. So, for us, it’s been an absolute strength. We wouldn’t be where we are if we didn’t have those folks.
[6:28] How did you go from this academic, very popular, open source project (Spark) to a company… And then from zero to $10 million in ARR. Were there any defining moments, perhaps any hacks, any growth levers that you guys used to go from zero to one, or zero to 10 in that case?
[7:06] I think the zero to 10 [million dollars] journey is very special. It’s very different from the rest of the journey. So, we’ve been through three phases and I can explain each of the three. But the first phase is really the product market fit phase. So, you have a product, can you find fit between the product and some audience that really loves that product? Can you make that happen? And there were challenges around that. I’m happy to explain what they were. And then once you found them, you’ve got to figure out what’s the channel that can connect that product with that market? So, you have product market fit, but what’s the channel to sell to them? There are different ways you set up the channel. And we actually got it wrong initially, so it took us a couple of years to actually figure out what the right tweaks to it were. So, those were definitely very special years where it’s a lot of experimentation to figure out what the right model for Databricks is.
[7:54] Actually, I would love to take you up on this and double click on this.
[8:01] Let’s start with the product and then let me talk about the channel. On the product side, we had an open source technology that we had built at UC Berkeley. That’s not necessarily what big enterprises needed. Big enterprises, they didn’t have PhDs from Berkeley working on this stuff. So, we need to significantly simplify this for them. So, we started hosting it in the cloud, but it turned out even the cloud version was too complicated for them to use. So, we started iterating with the users. And when you start interacting with them, you start realizing, “Wow, okay, we’ve gotten some things wrong. We need to significantly simplify it.” So, we actually started cutting away a lot of the features and functionalities. And actually, at some point, we actually reinvented it again. We said, “If we go back and we do it again, how would we do it if we knew everything we know now?”
[8:47] We came up with this technology that’s called Delta, which is another open source project, which you can think of as Spark made really, really simple and automated for the large enterprise. So, that was one learning, right? When we were at UC Berkeley, we were thinking, “Well, you probably have a PhD using this. We should probably give you every knob that you need so that you can tweak and twist it, then do whatever research you need to do with this.” right? And then when you start spreading this across enterprises, you realize not everybody has a PhD and all the knobs, they don’t know. So, that’s when we developed this technology called Delta. On the channel side, the mistake with it is, we really early on were really big believers in this product led growth.
[9:26] We said, We’re going to build this beautiful simplified product that we now have. We put it online and it’s going to be cloud-based. So, people will swipe their credit cards and they’ll come just use this, and we’ll be very successful. And for sales, we can hire inside salespeople that just hit the phone and they’re just calling young kids. We’re not going to get the enterprise sellers. And we liked that model better and who doesn’t, right? It’s cheaper and it’s less complicated. So, that was a mistake. You don’t get to pick your channel. You don’t get to say, “Oh, I want my ASP to be 50 K or 60.” That’s not your choice. You have a product. You have a market. If it has fit, you have to find the right channel to connect those to.
[10:05] The right channel, if your solution is a big data processing system that can provide artificial intelligence that’s really strategic to big enterprises, then who makes that decision at those enterprises to say, “I will double down on Databricks and buy that.” It’s some executives high up in that organization. That executive, the person that’s a data scientist swiping a credit card, doesn’t have a say. They’re five levels down in the organization. So, you need enterprise sellers that can actually connect there and talk to them their language. And you need to be able to talk to procurement to do that $5 million deal or whatever it is. So, we needed to change our channel to become much more enterprise focused. And that was another thing that we had, those were two big changes that we made. Otherwise, it wouldn’t work.
[10:50] We’ll come back to go to market, I’m very interested in this bottoms up versus top down motion. But let’s talk about product. One of the fascinating things to observe at Databricks has been the pace at which you guys have released new products and morphed it all into a platform. I had the pleasure and the honor of hosting your co-founder and CEO at the time, Ion Stoica, in 2015 and the conversation, I re-watched it before this, and the conversation at the time was all about the advantages of Spark over MapReduce, right? And that was very muchDatabricks through 2015. And fast forward to today, this seems, and correct me if I’m wrong, but that went from Spark to machine learning and AI workbench to the Lakehouse, which we’re going to talk about to now adding SQL analytics on top. Walk us through the product thinking – how one product led to the other.
[12:01] We started with Spark. It lets you get access to all your data sets, right? But with this, people were starting to create these data lakes in the enterprise, which means a place where they could store all their data cheaply. And towards that goal, they were extremely successful. So, people were amassing huge amounts of data in the data lakes. But after a while, the business leaders were saying, “Well, I don’t care how much data you have there, what can you do for me with that data?” And that’s when we were trying to build these applications on top of it. The machine learning use cases and real-time use cases and they were struggling. So, they would bring us in for professional services in 2015. And in 2015, we looked at why are we doing so much professional services to help these folks?
[12:43] Our revenue was tiny, and we started looking at the use cases and we realized it’s too complicated. There’s too much configuration. That’s why they were pulling us in. So, that’s when we said, “If we have to redo it, we have to simplify, what would we do?” So Delta was the first thing we started working with. We didn’t open source it initially. So, that’s why people get the timing, different labels. The first innovation was really Delta, which was redoing Spark, but in a way that’s really enterprise friendly, super simplified that enables you to get all these use cases right on top of it. So, that was number one. So, once that had broad adoption, I think Apple on stage talked about how they built a SIM security system on top of that and so on.
[13:24] We started looking at what are people doing with that data? And it was very natural for us to then go downstream and say, a lot of people were excited about data science and machine learning. But the problem was the ecosystem of machine learning was too spread out. Every university was coming up with a new thing. Every company was coming up with the next thing. And the data scientists wanted to use this, and the IT departments were freaked out saying, “We can’t support all of this.” So we built MLflow, which basically was the idea that, “How do we get all these projects together? What would be the glue in machine learning needed to get all the ecosystem together?” So that was a mouthful, right? So, now we have covered the data science and machine learning use cases.
[14:01] That’s when we set our sights on, “Okay. If we want to broaden databases to even reach bigger audiences, not just the data scientist and machine learning and data engineers, how do we reach, really broad mass?” That’s when we started targeting business analysts. Business analysts, they were used to dashboarding like Tableau or Power BI. And they wanted to use just SQL at best if they wanted to do something advanced. So that’s when we started a few years back, I would say three years back, working on our basically data warehousing capabilities, but building it into the core infrastructure which we call the Lakehouse and we announced that last year. So, our secret sauce is: look at the enterprise problem, figure out what that is, understand it deeply by being really customer obsessed, bring the problem back, have the innovators, the PhDs that know how to solve these problems, solve the problem, iterate quickly in the cloud with the customer. Once it has product market fit, open source it. Build huge open-source momentum, almost like a B2C viral thing. And then, monetize that with a SaaS version in the cloud.
[15:04] This was inspired by AWS. We thought AWS was the best open source company on the planet when we started Databricks, right? They had all these open source software. They hadn’t developed it. Other people have developed it, but still the monetization model was pick up open source, host it and make a lot of money on it. And we just tweaked that. So, we evolved it. We said, “That’s a great business model. We’re going to have the AWS business model. We’re going to host open source software in the cloud. But the difference is, we’ll create open source software. That way, we get the competitive advantage with respect to anyone else who would want to do the same thing.” Otherwise, anyone can pick up any open source software and host it in the cloud.
[15:39] That’s great. Fantastic. So much to unpack here. So, let’s start with the Lakehouse and maybe walk us through the evolution of data lakes and data warehouses and how the Lakehouse is the best of both worlds.
[15:54] It’s pretty simple, actually. People have data lakes where they’re storing all their data, video, audio, random texts. Anything they find, they just dump it there, right? It’s quickly, cheaply, it’s distributed around the world. Everybody’s doing it. Every enterprise is doing it. Then, they can do machine learning and those kinds of things. Those datasets, those variety of datasets, you typically do machine learning. So data lakes, you do AI on them. So, AI, data lakes. Okay. You want to do BI, not AI, you use data warehouses. So, there’s a separate technology stack for data warehousing and BI. If you think of it, both of them are the same thing, same datasets, right? But some are video, audio, just more advanced, but a lot of it is similar. And then BI is used to ask questions about the past. What was my revenue last quarter?
[16:36] AI is used to ask questions about the future, which of my customers will return in the future. So, today, you have two separate stacks for this and you have to have two copies of the data, right? And you have to manage this and it creates a lot of complexity. That’s not how the FAANGs were doing it back in the day. They had one unified platform for it. So, the idea is unify these two into one platform – lakehouse, data lakes for AI piece – asking questions about the future. And then the house, the warehouse is the structured part, but asking questions about the past. The combination of these two will enable enterprises to move faster. And it’s one platform for data engineers, data scientists, and also business analysts so that they can work together across the enterprise. So pretty simple, it’s just one data platform for AI and BI.
[17:21] What was the big technical breakthrough to enable, to provide that layer of structure on top of the lake house, it was Delta Lake? I think that it was Iceberg that came out around the same time. How does that work?
[17:38] Yeah. I think the four technological breakthroughs that happened at the same time, 2016, 17 at the same time. The one we contributed was Delta lake. There was Hudi. There was Hive ACID and there was Iceberg. So, four technologies at the same time kind of started. And with a lot of breakthroughs in science, that’s kind of what happens, several groups at the same time will, like the DNA, crack the code of it in the U S and in the UK. So, the problem was this, you had all this data in the data lakes that people had collected. It was super valuable, but it was very hard to do structured queries on it. Basically SQL, basically BI. So for that, you needed a separate data warehouse. Why was it so hard? Because the data lakes were built for big data, large data sets.
[18:21] They weren’t built for really fast queries. So, they were just simply too slow and they didn’t have any way to structure the data and give it tabular form. That was the problem. So, how do you take something like a big blob storage for data and turn it into a data warehouse? So, that’s the secret sauce was these projects. We basically figured out ways to work around the inefficiencies of these data lakes and enable you to get the same value you would get out of data warehouse directly there on your data lake. So, that was those projects. And, they were published in academic conferences around the same time and immediately they got a lot of attention from enterprises because enterprises had so much data in the data lakes. So, it was a really bad option for them if they had to move it out and put it in a data warehouse or move it to some other system, because this data has gravity.
[19:18] Are there any trade-offs to this approach?
[19: 21] Not really. You can have your cake and eat it too. I know it sounds crazy, but you can. It’s reducing a lot of the techniques that were invented in the eighties and nineties by data warehousing vendors, adapting them to making them work on the data lake. You could ask, “Why did this not happen 10 years ago or 15 years ago?” The ecosystem of open standards didn’t exist. It’s slowly emerged over time. So, it started with the data lakes, then there was a big actual technological precursor breakthrough to this that we’re talking about here, which was standardized formats for the data. They’re called the Parquet and ORC, but these are data formats that the industry standardized all their data sets on.
[20:04] Those kinds of standardization steps were needed to get this breakthrough of the lake house. It’s kind of like the USB, once you had it, you could then connect any two devices with each other. That was what was needed for the industry. So, slowly, what’s happening is that the open source realm an ecosystem is emerging, where you can do all of your analytics in this lake house paradigm. And eventually, it will be the case that you will not need all these other proprietary old systems that people have had since the eighties, the data warehouses and other systems like that.
[20:33] Actually, to this point, that was going to be my question. There’s a lot of industry chatter about the big upcoming clash between Snowflake and Databricks, as two gigantic companies in the space. So, is your vision of the future that the lake house eventually becomes the paradigm and then everything else over time gets absorbed, or do you view a future that’s more hybrid where you have data warehouses to do certain things and lake houses to do other things?
[21:04] I’ll answer it in two ways. And I really do mean both of the ways. I’ll start by saying, it’s kind of like people make it about zero sum. But if you answer it like this, do you think Google cloud will eliminate the Amazon cloud and Microsoft cloud, or do you think Amazon cloud will eliminate the other clouds? Nobody thinks that, right. They’re going to be around. They’re all going to be successful. The data space is huge. There’s going to be lots of vendors in it. I think Snowflake will be successful. I think they, right now, have a great data warehouse. It might be the best data warehouse in the market, maybe BigQuery would give them a run for their money. But it’s a great data warehouse. It’s certainly going to co-exist and it already co-exists with Databricks in probably 70% of the accounts we’re in.
[21:47] I think that’s going to continue to be the case, and people are going to use data warehouses for BI. But if you asked me long-term, the answer is yes, to your question. Long-term, I think the lake house paradigm will win. Now, it might be that the other vendors like Snowflake completely embrace it and revamp what they have to become that, or other players come along in that space. But in the long run, this is going to be the architecture that wins. Why? Because the data has so much gravity. All of it is sitting in these data lakes and more of it is getting into the data lakes. And the cloud vendors have a vested interest to drive more data to their data lakes. Therefore, any solution that makes that really valuable, is going to be the future. So, yes, I think in the long run, more and more will gravitate towards this lake house approach.
[22:32] Can you double click on SQL analytics, which is the most recent major release and major product addition, and including how you work with the existing ecosystem of BI solutions?
[22:46] That’s really our business analytics, business analyst, warehousing offering directly on the Lakehouse So, it has all the classic pieces of a data warehouse engine. So, in the past, when someone wanted to do SQL or warehousing on Databricks, we would offer them Spark. Spark has SQL, but Spark was written in Java. It couldn’t have the performance of the best in class data warehouses.
[23:11] So, I think two or three years ago, we set out to re-implement all Spark into C++, and what we call the really, really fast, what’s called MPP engine, massively parallel processing engine. So basically, a modern data warehousing engine written in C++ for modern hardware. It’s called SIMD Instructions. Modern hardware can do lots of instructions in parallel on the same data, right? So, it’s perfect. So, it’s really, in the Lakehouse, building warehousing capabilities straight into it. So, that’s what we announced last year. We’re excited about it. We’re seeing huge performance improvements. We’re actually going to reveal a lot of the performance improvements next week, or in two weeks, at our Data and AI Summit. So, that’s really exciting.
[23:52] Right. Which is on May 26 through 28, I believe. That used to be the Spark Summit, right?
[24:02] It went from Spark Summit to Spark and AI Summit because lots of people want to do AI. And then, our customers and the attendees pushed us, now it’s Data + AI Summit. It’s much broader, and I think we had 60,000 or 70,000 people attend last year. So, I encourage you to check it out.
[24:19] What’s on the roadmap?
[24:23] I think this Lakehouse vision and paradigm is very ambitious. So, continuing to build that out all the way up and moving up the stack on it as well, is where we’re headed next. That’s going to take a lot of resources and effort to do that. So, that’s why we’ve actually raised so much funds to do that. I think also, more and more, people want visualization layers. So, I think that’s something that’s in the works at Databricks. We’re doubling down a lot on that aspect of it. People want to be able to visualize and understand the data. Low-code, no-code, there’s more and more asks for, “What if I don’t want to code at all? What if SQL is too complicated?’ So, those are all areas that we’re exploring and thinking about what the best way to build those out is. But yeah, we’re definitely going to continue to move up the stack and then commoditize the stuff that’s below by open-sourcing and just releasing it to the market and making it the standard, and then moving up the stack with innovations.
[25:19] Still on the product front from an organizational perspective, I’d love to better understand how your product and engineering team is organized. And again, put this in context for people. It’s very rare for a company to be able to do a second product on top of a successful first one. But here, we’re talking, and maybe that’s not the right way to think about it, we’re talking about three, four, five different products. So, how does that work? Do you have a product and engineering team assigned to a product and another one is sent to another one, or is it more horizontal?
[25:55] This is deliberate in how we built Databricks from the beginning. We didn’t want to be a one-trick pony. When we had Spark and the founders were discussing, what should the name of the company be? And a lot of us said, ‘Maybe it should be Spark or something, Spark something.” Right? Just like Docker company was called Docker. But that’s when we said, “No, no, no, no, no, no. We’re going to lay one brick at the time. It starts with Spark, but eventually Spark becomes too old and we get rid of it. Then we move on to the next thing. It’s going to be lots of data bricks that we’re going to lay over time.” So, that was the whole thinking from the very get go when we started the company. So, how do you actually do that then effectively? I think it’s really important that you separate the innovations from the existing cash cows.
[26:34] There’s a great book on this called Zone To Win. In Zone To Win, they talk about how almost you need to configure your company to be the opposite. When you’re coming up with something new, you need to iterate quickly. You need to have the people, the engineers directly talk to your customers, not necessarily even have product management doing that. Innovate fast, iterate and almost have a new startup. On the other side, you need enterprise readiness and you need a much slower cycle to iterate, a different type of marketing messaging to resonate with business leaders instead of the people using the technology. So, we actually configured a company that way and we tell them, I’ll tell them, “Are you in the disruptive innovation or are you just in the maintaining the existing innovation?” which is a concept from that book. So, we set them up that way.
[27:21] Also, all of engineering and product is separated into two different pieces. One that focuses on the things that enterprises need, large enterprises, encryption, security, authentication, stability, and so on. And another piece that focuses on these innovations. So you actually org chart-wise should separate those out because otherwise what happens when you’re successful is that the former gets all of the resources because the big enterprises have infinite demand for the things that you’re doing. So you keep on building those things that expand your TAM, expanding TAM. I mean, I need the security feature. Otherwise, I can’t even look at your product.
[27:58] Okay. We have to add that, that’s a TAM expansion you’re doing, but actually, that’s security capability. It doesn’t actually have any innovation in itself typically unless you’re a security company. So separating these two out and making sure that they’re operating differently and that you’re funding both over time. I think, there are companies that have done it well, like if you look at Amazon Web Services, it’s not a one-trick pony, right? Amazon itself is not a one-trick pony, it keeps coming up with new innovations like AWS. So we wanted the company to be that way, therefore the name, Databricks.
[28:27] And to add one more layer of complexity, this, the whole open source to commercial dimension, right? MLflow Delta Lake, Koalas, which we haven’t mentioned yet. Does that fall in the innovation camp or is that the sub-layer of the commercial camp?
[28:44] No these are all innovation camp. So they’re all in the innovation camp. Of course, some of these projects, when they get older, like Spark they move into the maintenance side and we typically also move the people around. So it’s the same people that do the innovations over and over. We try to grow more of those innovators, but we try to move the sort of people that really have a knack for cracking the zero to one into the next problem, and then hand over the existing projects to other people who want a chance and a career to run, let’s say Spark, which is a huge successful project, right?
[29:13] It’s a big career step-up for someone to get that responsibility. When we moved the person that created it to something else to create the next thing. And we also find who are the ones that are good at zero to one things. And we actually experiment. We give people in R&D a chance to go experiment with the zero to one things and they don’t always succeed. It takes a couple of tries until they become really good at it. So you have to think deliberately about this kind of high failure strategy.
[29:42] If you were going to start another enterprise software company today, would you go open source first?
[29:48] Yeah, I think it’s superior. I think if you think of it from an evolutionary standpoint, it’s evolutionarily superior to the previous business models. Why do I say that? Because any proprietary software company out there is ripe for disruption by an open-source competitor. So anything that’s proprietary can immediately be disrupted. I mean, just like Windows got disrupted by Linux. I mean, that’s as advanced as it gets, right? That’s really complicated technology operating systems, right? Low-level operating systems for different types of hardware. You wouldn’t think some guy out of a university would invent that then that would become the standard in industry. Any proprietary software is ripe for disruption like that. The question is, can you make money on it? And that has been really hard up until Red Hat and all these companies that were doing support web services until Amazon Web Services cracked the code on the business model.
[30:42] The business model is we run the software for you. You rent it from us. That’s a superior business model because you actually then can have a lot of IP. That’s very hard to replicate. So I think that the next company I start would be that. And if you’re going to ask me, I don’t know what your next question is, if it’s going to be, what would you start in which area I would do it in AI? I mean, I’ll just be punched up because I think this is early days. We’re just scratching the surface of AI, especially operational AI. It’s going to get embedded everywhere. I know it’s cliche. Marc Andreessen said software is eating the world. We really believe that AI will eat all of the software. Any software you have, AI will creep into it, just like software crept into your car and your refrigerator and your thermostat, same thing will happen here. So this is really early days. I think anyone who joins or starts companies in the AI space, they’re early, they could be the next Google. So that’s what I would do.
[31:41] Music to the ears of this group for sure. We talked about open source, we talked about the go to market, at this stage as a very late stage startup if people can still call you a startup. Where does open source fit in the go-to-market motion? And coming back to the earlier parts of like bottoms up versus top-down, like, who does what, do you have, like a BDR group still versus the AEs? How do they all work together without stepping over each other’s feet?
[32:16] Databricks is a hybrid model. So there’s a top-down and a bottom up at the same time combined. We started, as I said, with bottom up, but we’ve kept it. So yes, we have BDRs, SDRs. They create opportunities that then they hand over, right? It’s a funnel that starts with marketing and the funnel bleeds in from marketing into the SDRs. SDRs then have, they get some of the leads from marketing, some of it just directly outbound from the SDRs, then it goes to the sales team, right?
[32:43] There is also a very interesting bottom-up sort of completely free sort of freemium, free tier funnel. So Databricks Community Edition is a completely free, use it all you want, never pay us funnel where you can use all of Databricks. You only get like a slice of a small machine. So you get kind of a taste of the real big thing, but you could use it forever. That then generates leads that also fits into the SDR. So, that also is a pipeline that’s really important. Half of our leads comes from that. So that’s why open source is an important engine for us. It’s half of the leads that come to sales comes from that. And if we were just doing Spark, like when we were on this show in 2015, that would have been probably 25% because over time, these technologies become mature and the excitement around them wanes. So, that’s super important to us.
[33:48] Now, we also have the classic enterprise sales motion where you might use your Rolodex and you go and talk directly to the CIO. But what happens is that the developers are also becoming more and more powerful in those organizations. So the CIO says I had a great conversation with the CEO of Databricks, I’m exploring this technology but I’m worried, is this the right choice for us? There’ll be people that in the audience inside that company that say, yeah, I use Community Edition. We don’t need to do a 6 month POC. I know these people they’re really, really good or I know them, they’re from Berkeley. I’m a big fan. I’ve used the tech. I went to some meet-up and so on. I follow them. So, that helps corroborate the use case there, you can eliminate the whole POC, a case then because they already know what it is compared to like 10-20 years ago where a sales guy comes in, explains how awesome the software is, but you can’t trust them. So now you have to launch a POC and then you have to actually set up the software on prime. We don’t have to, we can cut through all of those layers. So we combined top-down and bottom-up, and both are really necessary for Databricks to succeed.
[34:41] One last question from me, and then we’ll have a bunch of other questions in the Q&A and we know you need to run sometime soon. So completely switching tacks, a question for the entrepreneurs and founders in the audience, almost at a personal level, as you’ve grown from CEO of a by definition, small startup to now a mega startup and soon enough, a large public company. How do you scale yourself? How have you learned along the way and how have you switched from the job of being like the visionary storyteller to running a global organization?
[35:30] Yeah I mean, it really boils down to finding the right leaders that you can trust and building trust with them. So, that’s as simple as that. Can you find the right leaders that you can trust? I could spend all my days on shows like this and the company will continue to run itself. Why? I have a great sales team that’s well-functioning, I don’t have to be directly involved in it. I have great marketing. I have great engineering. So why do I have those great departments? I have great leaders in those departments and I trust them and we built this trust over many years. So, it really boils down to, I know it sounds simple or silly, but that’s the problem you have to find out. And I think a lot of early-stage, and I certainly had this problem as well in the early-stages, you have this situation where you’re like these people that are running as departments don’t know what they’re doing. I have to do it.
[36:15] It’s about me, me, me, and then you go in and you have your fingers in the pie all over the place – that doesn’t scale, because as your organization gets to 150, 250, 200 people, Dunbar’s number. Now, you can’t anymore remember even what’s going on. So you feel kind of completely inundated and behind all the time and frustrated. And then when you hit like a thousand people, it’s a whole other deal with, Japan office, that might not even speak English. So, you just have to find the right leaders that you can trust. And then they have to repeatedly do this all the way down. And then you have to find ways in which you connect with the organization and communicate with them that’s not direct communication. It’s indirect communication through the leadership. So you have to cascade it down.
[37:00] How do you find them? Do you have a bias towards promoting people internally, or do you think it works better if you bring in sort of snipers from outside who have done that stage of the company, how do you approach it?
[37:14] It’s so hard to find great leaders that work with your culture and that you can build amazing trust with that I think you shouldn’t exclude any options if you can promote people from within then. Great. But if you just try to promote from within you’re probably are not getting the experience that exists in the market. That experience can be super valuable. People have seen the movie, you need to also bet on them as well. One of the things we look for, we look for people who have seen build. So the joke I say is, do you have a driver’s license? And people will say, I have it. Are you good at driving a car? Yes, I’m very good at it. Why are you asking, can you build a car with your bare hands? And people say, okay, I get it. So can they build it? Not just drive it and maintain it.
[37:55] They have to have built the phase we’re in now. I’m not saying they have to build a $28 billion company from zero. That’s not what we’re looking for, but they have to, in our case, taking a company to a few billion dollars of revenue or seeing that kind of phase in engineering or marketing or wherever it is. So that’s what we’re looking for. Did they build it? And then we look at, did they have first principles thinking when they built it? Were they just joining in for the ride as those companies were going through it? Or are they actually kind of first principle thinkers that can think about how to actually build this? Are they the artists that actually create these things? That’s really important. I think IQ is very important as well, smarts to figure out this stuff.
[38:37] Then culture is like this complicated thing that people talk about. You have to have culture, but for me, a lot of it is, “Can I get along with the person? Do I want to spend 10 hours a day with this person? When things get really rough and difficult problems, is this a person that I can sort of solve problems with and get along with?”. And that’s going to be really critical. So what you do there is you just spend a lot of time with a person, it’s really not that hard. Who do you marry? You spend time with them. Do you like them or not? It’s the same thing here. It’s a sort of a marriage, right? You’re going to work with this exact, so many hours for the next 5 years. It’s going to be so difficult. So spend a bunch of time with them, off work, at work and then try the problems you have with them. Like, “I’m really thinking about doing this, but I don’t like this. What do you think?”. And hear them out and argue with them and see sort of, this person, I think the two of us could actually do great things, then hire them. If you don’t really, if you see that it’s not really working, then probably that’s the cultural mismatch.
[39:36] So rapid-fire question from the group, a question from Danny, where does Databricks fit in, in the emerging data mesh architecture and productization of data?
[39:50] Let me explain for us what a data mesh is as I understand it, or as I see it. As the organization gets really large, I’m talking about a hundred thousand employees or a million employees, does it make sense to have a centralized data team where everybody goes to that centralized data team and tells them, “Hey, can I get this data set added to the centralized data repository”, whether it’s a lakehouse or data warehouse or whatever it is, and they prioritize it and can you clean it this way for me? And can you make it available that way? And I want this tool and I want that tool. As you can imagine, it will never scale in a larger organization. That team would become bottlenecked. They would not understand how to prioritize the different projects because they don’t understand the asks of marketing sales, customer success, and everything would end up on the back burner.
[40:34] They would not sympathize with the departments. The departments would get, quote, frustrated and after time to start building up shadow data teams internally. Okay. So the data mesh is about how do we organizationally decentralize this to empower the different organizations to themselves go ahead and build what they need. Databricks runs today. Internally the company runs like a data mesh. So finance has its own Databricks and they run Databricks to do revenue recognition. So they predict the revenue using machine learning and AI. Customer Success runs its own lake house on Databricks. They actually figure out which customers are churning. Product runs, Databricks themselves, they developed Databricks but also run Databricks. And they use Databricks to figure out what features are being asked for by our customers. So it’s a way in which you decentralize the organization, but you still can have centralized governance, auditing and oversight.
[41:29] And you might have some really core data sets that a centralized team runs but you enabled others. And the lake house paradigm actually enables this. So it’s actually an organizational structure more than the technology itself. And we absolutely embrace it, but you also need technology support. And actually that’s why the Lakehouse is so important because you can’t tell the whole organization, Hey, everybody buy this data warehouse. Otherwise, we’re not going to work. So if you want to have a real data mesh, you have to be able to have some flexibility, some openness in what they can use, but you want to be able to centrally govern it. And the Lakehouse is perfect because it is open, it’s based on open-source. It works with the ecosystem of tools. So you can allow for the variety of diversity that you need in the different departments without ending up with anarchy, where everything is different. There’s no sort of centralized schema or no discovery or no security model. So we believe in the data mesh and a lot of people are using Databricks to build data meshes.
[42:28] One slightly more technical question from Danny about managed cloud containers for compute, since like Danny is a customer of Databricks, that says a huge reason Databricks was a value add for our machine learning engineers to become more self-service schedule jobs and for hosted data engineering pipelines for data warehouse, incremental loads. So, what’s your perspective on managed cloud containers is a part of your long-term strategy to support this alt cloud in the future.
[43:08] If the question is around Kubernetes and containers and Docker and things like that, I think it’s great. It’s again like USB, it’s another standardization layer that makes it easier to move things across these different things. So we support it. We think it’s great. We’re going to offer it on wherever you want to go. However, our experience is that you need much more today. I mean, we’re huge fans of Kubernetes, Docker. We standardize everything under the hood on Kubernetes. It enables us to move between the files, but in the data space, you need more. You need a catalog, you need data discovery tools. You need ways in which you can search for your different data assets. You need ways to do security on your data assets. You need ways to dashboard it. You need ways to query it. So you need these as well.
[43:50] So, in some sense, we’re trying to build Databricks such that it becomes that open standardized layer that you can move between the different clouds. But absolutely it will also have plugins that you can bring your own containers and running your own Kubernetes kind of apps or operators on it as well. So, this is what I mean with the ecosystem of open infrastructure, that’s actually being built up, that’s what’s gone on in the last decade or two, and we’re excited about them, we want to be part of it.
[44:25] No IPO in the immediate future, you raised a lot of money recently, so you don’t have to go public right now?
[44:33] We haven’t set a particular date to anyone. We’re going to be IPO-ready this year and have marched towards that and are pretty far along in terms of sort of the readiness of the business everywhere, but then exactly when we’re going to exactly go public is – we haven’t shared that. It’s also not something that I obsess with. A lot of people ask about this, but the way I think about it, Databricks is on a long journey that’s going to take many decades and the IPO will be an initial public offering that happens at some point, and then nobody looks back at it, right. Nobody looks back at the Facebook IPO right now and obsesses. Should Mark have done it six months earlier, or should he have done it later? And what was exactly the price list, let’s go back and analyze that decision? It doesn’t really matter for what happened to Facebook, the decades to come later.
[45:26] I think when you said when you raised the billion, that it gave you a lot of the advantages of an IPO without having to be public just yet, right, that was the thinking?
[45:35] I mean, it was great to get that kind of capital so that you can really go invest in R&D and do these innovations that you want to do. And also double down on the go-to-market. It’s expensive to set up. It’s almost restarting the company again, like when you set up your Japan team or your China team or your Korea team, it’s like starting another company again from scratch, you need HR and legal team. You need partners, you need marketing, you need all this kind of thing. So it’s like almost starting all over, that’s costly and you don’t necessarily see the return on investment immediately, it takes a few years.
[46:06] All right. This was absolutely wonderful. Thank you so much on behalf of the entire Data Driven NYC audience for sharing all of this so candidly. Very fun and incredibly exciting to hear about the journey and best of luck for the future that’s obviously going to be incredible for the company, so really appreciate it. I look forward to seeing what you guys do over the next few months.
[46:39] Thanks, Matt. I love your questions. Thank you.