The last couple of years have seen a dramatic acceleration in the adoption of graph databases, a category of databases that stores nodes and relationships instead of tables, or documents. That acceleration has clearly benefited Neo4j, which had a banner year in 2021, surpassing $100M in ARR and closing a $325M series F financing round at over $2B valuation, which it calls “the largest funding round in database history”.
That would make Neo4j an overnight success, except for the fact that Neo4j started in 20007, pioneered the space and literally coined the term “graph database”.
Neo4j’s CEO, Emil Eifrem, had spoken at Data Driven NYC back in 2015 (the same night as the CEO of Snowflake and the CEO of Airtable, a pretty stacked line up considering those three startups combined went on to represent many billions of market cap/valuations).
So it was particularly fun to have Emil back at the event and exciting to hear about the major progress the company has experienced over the last few years. Emil spoke from Sweden at around midnight his time, bringing impressive energy despite the late hour and it was a great conversation.
Below is the video and full transcript.
(As always, Data Driven NYC is a team effort – many thanks to my FirstMark colleagues Jack Cohen, Karissa Domondon Diego Guttierez)
Transcript [edited for clarity and brevity]
[Emil Eifrem] We raised this round last summer and it’s the first time that we went out with some numbers like the valuation, which was north of $2B. It is actually the largest round in database history. Mongo, for example, who’s kind of the early mover in the broader, modern, non-relational database space, raised a total of about $300 million cumulatively, so [ours] was the largest round in database history.
That’s exciting because graph databases, the category we helped define and evangelize at events such as Data Driven back in 2015, has kind of by and large been seen as a really valuable corner of the database market, but also a niche market. It used to be that people would say, “Yeah, great technology, kick-ass CEO.” Okay. Maybe not too much, but really just useful for social networks.
That used to be the thing back in the day. Then fast forward five years, it’s like, “Well, great technology but really just useful for a few use cases.” Then every year, those use cases start expanding. Of course we have the privilege of first basis information, so we see the breadth of use cases and the perception is always lagging that naturally. The fact that we then went down and raised this big round was one of the signals that the category is truly taking off.
How big a company is Neo4j?
We’re just north of 600 people, I have no idea how many we were back in 2015. We actually just earlier today went out with a momentum release where we talked about how we crossed $100 million ARR last year. Just to give a flavor, I think there’s five database companies that have crossed a hundred million, let’s call it the NoSQL crowd, or modern operational database companies. It’s MongoDB and then it’s us and Redis we’re on that kind of MongoDB path, and then there’s Couchbase and DataStax that have been traditionally on maybe a little bit of a different path right now. They are growing maybe at a slower pace and plateauing. Maybe they’ll turn around and become amazing again but it’s really down to Mongo and us and Redis, who’s in that cohort at the moment.
Why is this space accelerating, going from niche to much broader acceptance? I’ve seen the chart, that famous chart on DB-Engines which showed that graph databases is by far the fastest growing category in databases. And I read somewhere that Gartner calls graph databases the foundation of modern data analytics, so what’s happening?
There’s a lot of factors that I think are contributing to and accelerating and enabling the broader shift towards alternative databases that aren’t specific to graph databases. Things like the platform shift to the cloud and then there’s advancements in architecture like microservices and containers that enable you to more easily swap in a new type of database, stuff like that, that is as applicable to any database as to graph databases. The thing that’s specific to us is this broader trend around the world becoming increasingly connected and the fundamental premise behind what we do is super simple. Actually, in fact, today people might even call it simplistic, right? Which is what I just said. Everything is increasingly connected, hardly a controversial statement on a Zoom call from New York. I’m in Malmo, Sweden, right now, a bunch of people are, I’m sure, calling in from New York, but also elsewhere probably on the planet.
So everything is becoming more connected. We all know that intuitively. But the consequence of that – that is a little bit more subtle. What is data? This is Data Driven NY. Well, data information describes the real world. As the real world is becoming more connected, data is becoming more connected and that’s neither good nor bad, that’s just objective observation of what’s going on. But what that implies though and the consequence of that is that connected data exerts this massive amount of pressure on the traditional relational database, because the normal relational database works with tables. You can model connect the data and tables, you call them foreign keys and you have a record with an identifier and then you have another record with another identifier. So Matt, you have ID 3 and I have ID 7 and we’re connected so then it’s a three and a seven showing that we’re connected. You can do it, but it’s really awkward and if you want to query along it, if you want to find patterns, how do things fit together? It completely starts breaking down and so what we did a hundred thousand years ago when dinosaurs ruled the earth we came up with this concept of what’s called a native graph database, we’ve optimized every layer in the stack of the database architecture, completely around connected data.
We’re not built on top of a different database running back, it’s a native architecture. That means that if you want to query along how things are connecting, or want to find patterns in that we are frequently not 50% faster or 100% faster, we’re a thousand times faster. Our customers frequently tell us that we’re a million times faster.So when you want to do a recommendation engine, you want to find patterns in, “Wait, who is Matt similar to and what have they purchased and how are they connected and connected to the product hierarchy?” That’s typically 5, 10, 12, 15 hops in a connected data structure. The graph database is freaking amazing at that.
You coined the term “graph database”, if I remember correctly, when you started the company in 2007. You were literally at the origin of the space which was just your idea and has now become a whole space with different companies and competitors. To recap, a graph database is a database that elevates relationships as first class citizens, as opposed to just like rows and columns, the product understands how things.entities are connected to one another in the most simple layman’s term, is that correct?
That’s spot on.
What are the use cases? You just mentioned recommendation engines and I think Airbnb is a classic example of that, but give us a range of the different use cases including how Neo4j customers use the products.
Recommendation is one example, fraud detection is another one.
Traditionally you wouldn’t think of it in that way but what all fraud detection software is doingis it’s trying to find anomalies. It would chart out, let’s say it’s credit card fraud you would have two dimensions. One is the number of transactions, the other one is dollar value per transaction. Then we would create a scatter plot of that and you would find the band of what’s normal, and then everything that’s outside of that is an anomaly. So – “Dear fraud detection analyst, investigate that anomaly”. Basically like that, except it’s not two dimensional, it’s like 19 dimensions or something like that but conceptually it’s the same. That’s great, we’ll capture a bunch of different things. What it won’t capture is that what if you have a number of transactions that are all within this band of what’s normal, but they’re connected in fraudulent ways, like a fraud ring, the only way you can find that is if you can operate and connect the data and that’s what graph databases do.
That’s another classic use case but then you have a bunch of other things like: customer 360 (how’s my individual customer connected to external social media but all of my internal systems), or data lineage, very important in regulated industries. How does an individual data item evolve over time for GDPR and compliance reasons. You might need to do that. In entitlement or identity access management, KYC, you go down the list, it turns out that there’s a lot of use cases where the value is in how things fit together.
Then coming back to your original question, why is the category taking off? I say well, it’s because everything is becoming more connected. I’ll give you an example of this. When you and I first met in 2015, supply chain was not a use case for Neo4j. Why? Because most companies that produce physical goods, that produce stuff, they might have a supply chain that is two, three levels deep. If you want to digitalize that and analyze it, you can shove that into a classic relational database, a little bit awkward, your engineers will have to compute some joins and whatever but doable. Fast forward to today, in 2020 in particular, and the start of the pandemic for sure. Today in 2022, any company that is producing physical goods is tapping into this global supply chain, spanning continent to continent. That is frequently 20, 30 hops deep and all of a sudden, if you recall last year, the Suez Canal was locked for a week. Then how does that cascade across my supply chain? Well, the only way you can figure that out is by digitalizing your supply chain and then all of a sudden you’re dealing with this deeply connected data structure. If we abstract that and we figure out what’s actually happened here. What’s happened is that it’s actually the same use case as back in 2015 when I was on stage in New York. It’s just that it’s exactly the same use case, but the world is so deeply more connected now and therefore data becomes more connected therefore, it’s now a kickass use case for Neo4j and graph databases. This is just happening across use cases, across industry, across verticals and that’s the wind behind our back.
So you have key value stores, you have document databases, you have relational databases, you have graph databases. How do I choose the right tool and how does it all work together?
It’s actually pretty simple. You start with the shape of the data and you look at the workloads that you want to run in that data. If that data is very tabular, if it’s a payroll system and you want to record all the individuals and they’re all well structured, all of them have exactly the same schema and you want to calculate average salary and stuff like that. Awesome relational database go. Or if you have a bunch of adjacent documents sitting around and you don’t really care how they’re connected. Document database go. Or if you have a data set that is highly complex that is evolving where the business requirements change, where the values in how things fit together like a shopping cart which is connected to order items. Those order items are connected to product that sits in a product hierarchy, and how things fit together, a graph database is your best fit. That ends being the first go to move – look at the shape of the data and then the queries you want to run on that, then that’ll clue you in very rapidly where you should try to evaluate first.
To be a Neo4j user, you require people to use a different language called Cypher and I’m just curious how that compares to SQL, which is really the language that everybody knows for databases. Why is that a different language and how steep is the learning curve, if you know SQL already, to know Cypher?
The big comparison is probably something like the following. SQL is old and boring, Cypher is new and sexy. That’s it. [laughter] No, it’s actually spiritually, very similar. It’s a declarative query language which basically means that you don’t have the right programming language in corrective code depending on how technical the audience is. But you can type it in a very simple… you can describe what pattern you’re looking for and you draw it and some of the people who are older in the audience will recall this with something called ASCII art. Which is basically you end up drawing, like you draw notes using parentheses and then with arrows, you describe the little pattern and then you throw that to the graph database and it’s going to find that pattern and return it back to you.
So spiritually very similar to SQL but, the really pretty astounding, one of the biggest things that have happened since 2015 – it’s probably a good thing for us to contrast to what it was like last time we spoke – is that Cypher is the most popular graph database query language, but what we’ve ended up doing is that we went to the SQL Committee, the committee that is standardizing SQL, and we said, “You know what? We don’t want Cypher to be proprietary just to Neo4j”. Yes, today it is one of our key competitive advantages to other graph databases out there but the entire space is better served if there’s a unified standard query language for all of graph databases. Just as a little bit of a background here, every single new database paradigm since the mid-90s has gone to the SQL committee and they said they want to standardize the correct language.
Object databases tried that in the mid ’90s, the SQL Committee said, “You know what, object databases, you’re just a feature of SQL. So we’re going to incorporate some of your functionality into SQL, but that’s it”. XML databases in the early 2000s, they went to the SQL Committee and said, “You know what? We can just sprinkle some XML syntax into SQL.” Document databases in the early 2010s, mid 2010s actually, went to the SQL committee and said, “We want to standardize how you query document databases.” The SQL Committee, “No, you’re just a dialect of SQL. We’re going to spray some JSON into SQL, it’s not needed.” For the first time ever in the history of databases, the SQL Committee looked at Cypher, looked at graph databases and then said, “You know what? This category is here to last, this is an actual sibling to SQL.” And they created the GQL language which is at this point 98% identical to Cypher, our query language. It’s, again, the first time in 40 years this has happened. I think that’s a pretty stark blessing around the future and the value of graph databases as a category.
A couple of questions from the audience that very much cover where I was going to go next, so let’s use those. First question from Balaji, “There has been a flood of investments in the graph DB space, how does Neo4j differentiate itself and more broadly, is there opportunity for more than one player to exist?
It’s a great question. A couple of things on that in terms of differentiation. We’re kind of the OG graph database. We’ve been around the longest. If you attend Data Driven New York, you are probably somewhat clueful about data so you’ll know that in many product categories, you kind of want to be the new kid in many ways. For databases, maturity, robustness, stability is actually a key part of the value proposition. The fact that we’ve been around, we were the OG, the one that defined it and so on and so forth is actually a massive advantage because what this means is that we have by far the most robust product, by far the biggest developer community, and by far the biggest reference account base. So most customers by far of all graph databases out there.
We’ve also have this modern, which maybe sounds a little bit weird, this native graph architecture where a lot of the more recent – as the graph space has become harder and harder – the more recent entrants, what they try to do is they try to layer graph functionality on top of their existing core. They don’t take the native approach, which takes forever to build but that’s ultimately the only way to get to the scalability and the performance. So that speaks to the first question in terms of is there room for more? I absolutely believe so. I think this is an absolutely massive market. Databases is the biggest market in all of enterprise software. It’ll soon be a hundred billion market. I think graph databases can be a significant chunk of that 20, 30, 40 billion dollar. So obviously there’s room for more than one company.
And one of Balaji’s questions was precisely to your point about the established customer base. If you could share a customer growth profile, like how many customers, how fast are you acquiring, in what space, what industries, what verticals – anything you can share?
We have over a thousand customers in production right now and hundreds of thousands of active developers in our community. Just to give you some quantifiable things. Over 75% of the Fortune 100 are using Neo4j today. All 20 of the biggest banks in North America, all 20 of them are using Neo4j, 7 of the 10 biggest retailers in the world are using Neo4j, 4 of the 5 biggest telcos. So that gives you a little bit of a flavor. 99% of this will be a data thing, because we’re still in the I guess in the pandemic era. But I guess Matt you were just on a plane, right? Anyone who’s ever ordered a flight ticket – 99% of all flight ticket calculations – so which route should I go from point A to point B when I fly from Paris to New York? Is that a direct flight? Do I connect in Heathrow, how do I get there? It’s done with Neo4j. 99% of all airfares.
That’s a crazy stat, that’s amazing.
Then every single room you’ve ever booked in Marriott or any kind of hotel that is owned by Marriott, the Ritz Carlton and all that kind of stuff, all of that is calculated with Neo4j. So very likely you’ve actually used Neo4j if not today at the very least this business week so it gives you a little bit of flavor.
Very cool. Couple of questions from Gaurav. First question is ‘Emil, who is your favorite Indian American board member of all time?’
[laughter] I assume Gaurav is Gaurav Tuli, who was on my board for the longest time and he’s with a firm called F-Prime Capital and he was for sure the MVP of my board, which I’ve been saying both publicly and privately. Any chance, no offense to any particular VCs on this call, but if you have any opportunity to raise money from F-Prime or for that matter FirstMark, I have to add, you should go ahead and do it.
A second question, “Although graph theory as a math concept is not new, you’ve evangelized a new category of graph databases for a long time. That must have been lonely – can you talk about some of the highs and lows of the journey and now that Neo and the category have made it – quote end of quote – can you talk about any secrets to category creation in the data world?”
I’m obviously an engineer by background and training but I’m a student of and a lover of marketing. I think marketing is very, very interesting and category creation happens to be one of the areas that I really love in marketing. One of the reasons that I love category creation is that it’s so counterintuitive. For example, when you start out, we coined the term graph databases right now, right back in the day. When we did that, we started thinking, “What does success look like, 10 years down the line, what does success look like?” Well, success looks like we have a bunch of big companies that are competing against us, that’s what success looks like.
You look at today and you see who’s participating in the graph space? It’s Amazon, it’s Microsoft, it’s Oracle, it’s SAP, it’s like the entire axis of evil enterprise software companies are in the space. Along with a cohort, I mean one of the previous questions alluded to, around a cohort of younger startups. That’s what success looks like when you do category creation. You have a thriving category because if not, then you’re probably not doing something that is valuable enough. That’s kind of one of the things that in the early days you’re just talking to everyone and you’re evangelizing and all of us, like you, every single person that you talk to that know graph databases and understand the value of them, you either talk to them directly or like one hop away. Then all of a sudden there’s a tipping point where like, “Wait, I have no idea how this person heard about graph databases.” So it’s starting to truly resonate in the market and so I think that that was the huge tipping point for us and part of that is honestly getting a bunch of competitors in the space which is a net positive thing for us as the leaders.
To get there was persistence, lots of talks, lots of content creation?
There’s a ton of that and then a deep focus on practitioners. We go to market by winning the hearts and minds of developers. And yes, we love to monetize the companies where they work. But we’re open source, we give it away for free, we have a free tier in our cloud service, Aura DB. We have a free tier of that one, and we win the hearts and minds, and then they wake up and they realize that they work at one of those top 20 biggest banks in North America and they have a problem and they have a bunch of connected data and they realize “You know what, graph database would be a great fit for this. I played around with it, or over the weekend or in evenings and whatnot and this would be a great fit for it.” And that’s when we engage commercially.
The other piece that we haven’t talked about, like a real high order bit that has changed since we last spoke back in 2015, is that what I just told you is absolutely accurate. The fact that we are so developer centric but today, and this happened just in the last 12 to 18, to maybe 20, at most 24 months, data scientists are an equally as big of a persona for us as the developer. So if you look at kind of our top line metrics around awareness or visits to neo4j.com or leads or engagement, or whichever way you want to slice and dice it. Data scientists are as prevalent today as developers because it turns out that the initial value prop for developers to build applications on connected data, is as true as it ever was and it’s a massively growing thing and so on and so forth.
But data scientists, they’re increasingly realizing that if I can extract how things are connected and use that as a signal, the relationships between data points, as a signal into my machine learning, all of a sudden I can increase my level of predictiveness. That didn’t used to… Google moved there five, seven years ago and they spoke publicly about it – graph based machine learning. It’s kind of true – where Google was 10 years ago is where the rest of the enterprise is today, and Neo4j is by far the best engine for that.
Balaji was asking if you were leveraging graph neural networks.
Awesome. Yeah, That’s fun. That’s exactly what I’m talking about here and this is an area where Neo4j is very unique amongst databases. You mentioned the site DB-Engines, DB-Engines today tracks over 350 databases which is kind of crazy. When I grew up as a developer in the mid ’90s, there were like four or five databases to choose from and they were all the same, they were all relational databases. Now there’s one with 350. There’s also I think there’s a great, landscape thing that some guy’s posting every year. That’s a great way to make sense of that. I don’t know if you’ve heard of that, Matt.
Yeah, I don’t know why one would do that. [laughter]
That sounds like a crazy thing to keep track of. This is a pretty powerful thing – out of those 350 databases, developers use them and get value from them – data scientists, they don’t want to use a database. The only reason a data scientist goes to a database is to get data out of it. They go to the database, not for value, but to get the data out of it and put it in their normal machine learning tool chain. With exactly one exception out of the 350 one exception, Neo4j. They go to Neo4j to put data into Neo4j to be able to use relationships as a signal into their machine learning. So we built out an entire new stack called GDS, Graph Data Science, that is built on top of the graph database that is targeting machine learning and AI, driven by data science.
This is an entire new motion and persona for us and it’s a very unique thing if you think about us fast forward a couple of years, public company, we have a deep developer adoption, an OLTP system of record for these core use cases in the enterprise, as well as being this essential must have ingredient for any machine learning pipeline out there. In a deep developer community and data science community, that’s a really powerful combination in one company.
That’s a good place to be. Let’s finish the conversation with go-to market motion. A lot of companies that we speak with, a lot of people want to do that open source sort of bottoms up effort and in many ways it feels like you’rewandering through the desert for a long time because you talk to individual developers that may or may not want, or may or may not have any budget to buy your product. At what point did you switch to targeting the larger enterprises? At what point did you get a sense that this was working and what did you do? Did you build a sales force to go after the larger enterprises? At what point do you go from bottoms up to tops down, if ever?
I was going to say “if ever”. On some level we had a bifurcated approach. Where we built the community and that is the long term focus and the right thing to do and so on and so forth but then we also went out and had hand to hand comment with enterprise sales. And we tried to identify for these core use cases where people have a lot of connected data today, not where we’ll have connected data five years from now because everything is becoming connected, but today which are really valuable inside of the enterprise, willing to charge hundreds of thousands of dollars. Pay hundreds and thousands of dollars. Then we tried to identify them, we knocked on doors through our own personal network, or our graph, as we like to call it and sell into that.
That’s much more to kind of see the community to get some of those anchor lighthouse accounts. We had a bifurcated approach like this in the early days. About five years ago, probably around the time we were at Data Driven New York, at that point we had shifted so over 85% of our ARR back then, and still true today, originates with an individual practitioner. Used to be an individual developer, now it’s an individual developer or a data scientist who found us through one of the free SKUs, be it on-prem commuted edition with the free tier in the cloud, played around with it and then over time realized, “Oh, I want to put this in production.” Then there’s like an entire monetization fans and a path for them at PLG path for them in the cloud and then all kinds of monetization triggers to shift mode to the enterprise edition on the on-prem.
That’s all like a bottom up motion and then we have some air cover. We don’t sell top down ever, we don’t go in and knock on a CIO door and sell top down. We do provide air cover there through GSI, through some of the Gartner quotes. There’s an endless list these days of massive validation for the category as a really deeply strategic investment for any Fortune 500 company. That really helps but the bottom up way of going to market is still the fundamental way that we take it to market.
One last question since we’re over time, but this is fun. A question from Tony, “Has the cloud changed your addressable customer base compared to the on-prem days?”.
Oh, totally right. If you think a little bit about what we did in the early days we broadcasted the value proposition of graph databases towards developers initially, and then more recently to data scientists. Where? Data scientists and developers everywhere, any geography, any size company, hobbyists, professional, wherever they are. And then, because we had in the on-prem world, because I think that was the question, how’s the cloud changed things? In the on-prem world, we then monetize a very thin slice of that, which is specifically you’re at an enterprise company, global 2000 company, you have a use case that is worth hundreds of thousands of dollars, you have access to that type of budget, you’re in North America and Europe. That’s where we monetize on the on-prem world.
So a very thin slice of this broader awareness that we had that we had created. With the cloud product of course, all of a sudden we have a free tier, we have a really cheap tens of dollars per month type, low end offering, all the way then entire kind of spectrum all the way up to million dollar mission critical deals for an enterprise, that is globally available. Now all of a sudden none of those constraints are true. It’s all geographies, it’s all sizes of companies, not just global 2000 but mid-market and small all the way down to individual developers. That’s a massive TAM expansion just on the developer side and then you add data scientists on top of that and that’s a really big slice of the overall data pie.
Well, it’s quarter past midnight, your time, you’re remarkably awake and energetic.
It’s called coffee, my friend.
Well that seems to be working, this conversation brought to you by Redbull and coffee.
This was wonderful, I mean it’s so cool to see the journey over the last few years.
It’s only just begun my friend.
It feels like it. It feels like you are tackling a market that was already super large and that’s in the process of becoming gigantic. If it becomes the cornerstone for machine learning, that’s as big a mega trend as it gets. So fantastic progress. Thanks for coming back and telling us your story and we’ll continue to root for you and maybe by the next Data Driven you’ll come back as a public company CEO, that would be a lot of fun.
Sounds like a plan my friend.