2021 MAD Landscape: The Top 10 Trends

For anyone interested in a quick overview of our long-form 2021 Machine Learning, AI and Data (MAD) Landscape, here are the Cliffs Notes! My co-author John and I did a presentation at our most recent Data Driven NYC, focused on top 10 trends in this year’s landscape.

As a preview, here they are:

  • Every company is a data company
  • The big unlock: data warehouses and lakehouses
  • Consolidation vs data mesh: the future is hybrid
  • An explosive funding environment
  • A busy year in DataOps
  • It’s time for real time
  • The action moves to the right side of the warehouse
  • The rise of AI generated content
  • From MLOps to ModelOps
  • The continued emergence of a separate Chinese AI stack

Below is the video from the event, and below that, the transcript.

(As always, Data Driven NYC is a team effort – many thanks to my FirstMark colleagues Jack Cohen and Katie Chiou for co-organizing, Diego Guttierez for the video work and to Karissa Domondon for the transcript!)

VIDEO:

TRANSCRIPT (lightly edited for clarity and brievity)

[Matt Turck]

Welcome! John and I are very excited to talk about our annual labor of love. This year we called it the MAD Landscape.

A quick word about us: John and I are venture capitalists. We work at FirstMark, which is an early stage venture firm based in New York that invests in all sorts of early stage companies, like Pinterest or Shopify or Airbnb, Discord, a lot of great companies.

We have a particular affinity for this world of data, machine learning and AI. And we have invested in a number of companies that we’re going to mention today, and also appear in that landscape. Some of which are on this slide including a bunch of unicorn type companies like Ada, Cockroach Labs, or Dataiku.

This event (Data Driven NYC) is also another labor of love of ours that started as a little meetup in New York almost 10 years ago now, and has grown into the largest community of its kind in the country, with 19,000 members. Plus over the years we’ve been running a lot of events. We really, really, really hope we can start doing in-person events soon. We’ll continue doing a little bit of `hybrid and in person, but that would happen soon, in particular so that we can celebrate the 10 year anniversary of the event together.

So stay tuned for that. That should be hopefully sometime soon and will be a lot of fun.

The landscape, if you sort of randomly ended up on this zoom and didn’t see the landscape itself and the blog posts, this is the link to find it, it’s on my blog mattturck.com/data2021, obviously encourage you to go take a look and if you like it, share it with friends and social media and all the things, always much appreciated, and that helps accelerate the discussion and the learning for all of us.

This landscape is something we’ve been doing for many years now, starting in 2012. At the time this was the Big Data Landscape, all the cool kids talked about big data, and how Hadoop was going to conquer the world. And so we called it the Big Data Landscape for like five years, and then as ML and AI had its own resurgence and acceleration, it became very clear that AI was the way big data was going to be operationalized into use cases. So this became the Big Data and AI Landscape for one year, but then, big data stopped being cool, and the cool kids no longer talked about big data, and that became very old school. So in line with the trend, I removed the term big data, and that became the Data and AI Landscape. And this leads us to today where we came up with the acronym MAD, for Machine Learning, AI and Landscape, which is like the new name of this new data landscape.

We also have a MAD Index of Public Ccompanies and a MAD Index of Emerging Companies as well. The acronym MAD is it made us laugh, and we decided to use it as it’s very apt for what’s going on in the ecosystem. So going back in time to 2012, this is sort of what it looked like at the time. And the reaction when we put that out for the handful of people that saw the blog post at the time was, oh my God, this is crowded, there are so many companies, surely consolidation is around the corner, this cannot go on like this. But lo and behold, nine years later, this is what it looks like.

We tried to shrink the logos as much as we could, but this is still what this looks like. And, for anybody that read the blog post, you may have noticed that this is actually just a subsection of the entire list. We have a separate Airtable list of companies in the data machine learning and AI landscape, that for one reason and another, we did not include in the graphic itself, but that’s searchable. There’s a lot of great companies in that list, so we’re talking about literally thousands of companies in this ecosystem. So, what we doing today, is that we’re going to try and do a quick run through what we call the top 10, so it’s a little bit of a CliffsNotes for anybody that fell asleep before they managed to read the whole blog post, and sort of everything you need to know about it.  We’ll try to keep it simple and quick, hopefully that still conveys the main ideas. 

The 2021 TOP 10 TRENDS

Every company is a data company

To succeed in the future in the next few years and in the next few decades, we think that every company is going to need to be not just a software company, but also a data company. It’s the mega trend that propels this entire ecosystem that we’ve been excited about for many years. There’s been a bunch of buzzwords, people used to call it big data, as we mentioned. And then people went crazy about AI. When I say people that includes us very much.

And in 2020, 21, 22, people talked about automation a lot more, but ultimately it’s all the same thing. It’s all the same trend, which is basically turning companies from analog companies that run on sending Excel spreadsheets around with a lot of manual processes, into companies that are intelligent and automated and run on data, which effectively means that companies get all sorts of inputs from customers, from the supply chain, from the market, and are able to adapt increasingly in real time, and optimize for the best next action and proactive reaction that enables it to increase revenue and be the best company that it can be.

​​The big unlock: data warehouses and lakehouses

So that’s our number two trend, it’s sort of fascinating for anybody that has been following the ecosystem for a while. People talked about Hadoop, with baited breath many years ago, and, there was like all this thing about how we could store and process big data, and that was going to be the solution. And then Spark came along, and that was like another whole thing that took several years, but it’s actually only very recently, with the rise of cloud data warehouses and their lake house cousin, that it feels like we can finally store and process big data at scale, in a way that is easy enough to do, that doesn’t require an army of super technical people and is sort of affordable in particular because these type of data warehouses have a pay as you go type of model.

With the Snowflakes of the world and Databricks, and then the hyperscale cloud players, it feels like we can finally store and process big data, which has unleashed lots of different things. First of all, you can become a data driven company much more easily because anybody can have their Snowflake instance. We at First Mark, with ultimately a 20 person venture capital firm, we have our own Snowflake instance, for example. And so we’ve become a data driven company. And once you have that in place, that unlocks a lot of things around higher level projects that you can do around machine learning and AI, so that sort of unleashes you to do more things. And then the rise of the cloud data warehouses has created a whole ecosystem around them, whether companies to put data in the data warehouses, or extract data out of the data warehouses, and we’re going to talk about this, what a lot of people refer to as a modern data stack, which is basically a whole new generation of startups that have come up in the ecosystem.

Consolidation vs data mesh: The future is hybrid

All right, as trend number three, that’s the question that is on everybody’s mind, that people keep asking. All right, this is crazy busy, is consolidation coming? Surely this cannot go on forever, there’s just too many companies. The market is confusing for everyone. Companies are going to disappear, and companies going to like merge and acquire each other. So, the answer to this from our perspective is that it’s a bit complicated and the jury is out, and there’s no real answer one way or the other. So this could be its whole own discussion. The quick summary is that we certainly see a trend towards centralization and big platforms. So certainly like the Snowflakes of the world, and the Databricks of the world are doing a lot of things, have a lot of capabilities. And there’s a whole ecosystem of companies that are centering around them, even in different corners of the market.

We are investors, for example, in a company called Dataiku, which is an enterprise AI platform, that effectively bundles a lot of the functionality that you need to be able to deploy enterprise AI, whether that’s data prep or ModelOps or transparency, fairness, analysis, governance, all the things. So there’s certainly a trend towards this. We’re seeing that the primary consumers of those centralized platform tend to be larger companies that don’t have access, or as much access to technical talent and merchandising talent, as some of the smaller tech native startups or potentially larger tech companies. So, that’s happening in one corner in the market. In another corner of the market, there is a trend towards decentralization. So the conversation around the data mesh, which we just had with Zhamak was a perfect illustration of this.

And we’re seeing plenty of smaller, younger, very nimble startups building point solutions, in lots of different corners of the market. And there is certainly demand for all these point solutions, the sort of people that are looking for the best of breed, and trying to stitch together those solutions, whether they are startups or whether open source projects, so there’s a lot of activity there. Now there’s tons of companies out there with all the venture capital money that has been raised that are looking for acquisition opportunities, for sure. So there’s tremendous pent up demand for acquisition, from all sorts of companies. At the same time, there is not, as far as everyone we can see it, as much supply given the market billions, there’s not as much supply of companies that actually want to be snapped up and acquired, there’s suddenly a bunch happening, but not as much as you’d think.

So there is this tension between lots of money, lots of companies trying to acquire, and then lots of companies starting, but not yet quite ready to get acquired. And sometimes, the deal happens, sometimes it doesn’t. So the future that we see is as much as the Databricks and the Snowflakes of the world would want to be the one platform where all things, data, and all things AI happen. We actually, from our vantage point, see much more of a polyglot persistence type of saturation. We have lots of different vendors, lots of different companies and lots of different ways of configuring data stacks. That’s what we see, at least for the immediate feature.

Explosive funding environment

[John Wu] Well, so moving on, as we all know, the last year has been pretty crazy in terms of funding and funding pacing, creating a very frothy venture ecosystem. So just the numbers to put some figures behind that. So in the first half of 2021 alone, there were 42 newly-minted AI unicorns, versus just 11 in all of 2020. On the round side, in terms of the amount of funding, it’s about the same. So $36 billion was invested in AI, and the ML startups in the first half of 2021, versus $38 billion for all last year combined. So almost twice as fast in terms of pacing. If you look at the bottom, you see monster rounds to companies like Data Bricks, which raised $1.6 billion at a $38 billion valuation. In addition Celonis, $1 billion at $11 billion.

On the acquisition side, there were also major M&A activity going on with Nuance in the speech and text recognition world going to Microsoft for $20 billion, Blue Yonder in the supply chain logistics going to Panasonic for $8.5 billion, and Segment to Twilio for $2.3 billion. The explosion of funding and companies that scale, particularly on the infrastructure side is largely driven by the shifts as Matt mentioned before.

 A busy year in DataOps 

Moving on to our fifth trend. So, in this case, we’re going to talk a little bit more about DataOps, similar to the rise of MLOps, which we talked about last year. This year saw the rise of DataOps and the tooling in the DataOps ecosystem, that seeks to bring best practices from more established DevOps and software engineering practices into the data world. Particularly from our vantage point, we saw an explosion of point solutions, and tooling around several key areas. The first of which is data observability, which a lot of people have designated as Datadog for the data stack. Here it’s automated monitoring of the data pipeline, alerting of things that go wrong and things like that. And within that, there are two sub-components.

The first of which is data lineage, data lineage is tooling that constructs the journey of how a particular set of data arrives at the end point, along with all the changes that were made up to that point. This is useful for monitoring and auditing, in terms of understanding what goes on, and also for identifying and fixing any issues that might happen within the data pipeline. The second piece of observability revolves around data quality, or monitoring of the data at certain points in the journey, for example when the data hits the warehouse. This can come in a bunch of different formats, including assertions, for example, expect column in a table not to be null, which is what companies like Superconductive with Great Expectations is doing. Or in an automated fashion, leveraging machine learning to monitor and validate the data which is done by vendors like Anomalo.

On the data access and governance side we’ve seen a couple of pretty large companies that were built historically. Companies like Elation or Collibra, but we’re seeing new challengers enter the space as well, offering next generation tooling and things like data catalogs, better process-oriented tooling, better data access tools and things of that nature. So DataOps as a whole, I think is starting to mature much more in terms of the process, and in terms of the culture, as well as the tooling across the board. So, we’re excited to see this mature over the last year or so.

It’s time for real time 

[Matt] Moving on to slide six. Real time. Real time is really interesting because it’s a trend that people have been waiting for, for a very long time. And every year has been the year where real time would be finally becoming a major trend for the last few years. And it seems that we are here now. So Confluent, which is the company behind Kafka, which is the real time sort of message, buzz, move data around in real time, had a wonderful IPO. Just checked today they are at $17 billion market cap. So they certainly proved any naysayer around the demand for a real time data software wrong, and clearly show that there’s strong market demand for it.

And then in different corners of the market, we are seeing a lot of stuff happening. So real time analytics, real time databases, certainly had a big year. Imply raised money, ClickHouse, raised money as well – ClickHouse being a very popular open source project that was started at Yandex, and just got spun out as a commercial for profit company. We had Materialize speak at these events in the past as well, so lots of activity around real time analytics. And then the rest of the ecosystem is also adapting to the real time world, so in particular, the swell of activity around real time data pipelines, this company like Meroxa or Estuary, which we are very proud of investors, that are trying to unify the paradigms between batch and real time.

The right side of the warehouse: Metrics stores, Reverse ETL 

[John] So we all know that the data warehouse has been coming around for a couple of years with Snowflake and the IPO and the rapid adoption, across everybody from small companies all the way through the enterprise. That’s been pretty clear cut. The thing that we find really interesting in the last year, is the proliferation of tooling on the right side of the data warehouse, or what happens after data hits the warehouse and how do you deal with it, how do you use it and so forth. In particular, there are two examples here, the first one of which is the metric store. So to explain how the metric store works, I’m going to talk through a quick example.

So let’s take a user table that we have; if Matt and myself are both looking to figure out DAUs or daily active users over a certain period of time, historically we’d probably look in our code base, we look at snippets of SQL that were shared amongst different members of the team. We’d manipulate them a little bit, to get a better sense of the window that we want to work with. Maybe, the segment of the DAUs that we want to look at, either by geography or by categorization, by product users, or something like that, and then we’d run the query. What happens here is that there’s going to be some amount of adjustments that we’re each doing on our own machines, which means that if we’re talking about the time periods, the characteristics, we might have queries that look pretty similar from the top level, but might have small adjustments, that would mean that ultimately, the data sets that we’re using and interpreting are not exactly the same. So this is where the metric store comes in.

The metric store sits on top of the data warehouse and is a unification layer that allows companies to define their key business metrics, and also control the dimensions of different dimensions of the metrics that can be served to the end users in a self-service manner. So if we think about the DAU aspect, DAUs will be defined as a verified, validated metric, where anytime the backend changes with either schema, pipeline or so forth, people who own the metric will go to the metric store, validate that, make sure that it’s still up to date, and then therefore anybody that is consuming that metric is going to be consuming the same definition and consumption and data set on the end, whether it’s through a BI tool, whether it’s through a query or anything else.

In the last year, we saw a lot of early stage funding going into metric stores, including Trace within our portfolio amongst others. And we’re excited to see this transition from being a staple and only big companies like the Airbnb’s in the world with Minerva, to being broadly available to anybody who’s using a data warehouse.

The second piece that we’re excited about here is the reverse ETL. So historically on the warehouse side of the world, the warehouse was seen as a place to centralize operational data, whether it’s user data or anything around that, from upstream business applications, Salesforce, Zendesk, Gainsight and so forth. Data teams historically would conduct analysis here on that unified data, and pass insights onto the relevant operational teams. Reverse ETLs change this paradigm by closing the loop. So these tools enable companies to pipe the transformed unified data from the warehouse back into the upstream business systems from which the data was generated.

What this means is that now Salesforce can hold customer ticketing data from Zendesk, billing information from Stripe, customer interaction data from Gainsight and so forth. And then business users can work on this unified scope of data and trigger workflows based on a holistic view of everything that’s happening within the customer across all touch points. What this allows is it allows for a much broader shared view of data for customer data, for everybody in the org, not just the data user and the data consumer. A lot of these companies have gotten a lot of funding within the last year, and we think this is a space to watch more closely.

The rise of AI generated content 

Going on to the eighth trend, we want to talk a little bit more about AI generated content. So, whether it’s synthetic video, or text, or anything in between, we’re starting to see real adoption, with the first batch of real business use cases within synthetic media and synthetic content. So a couple of pretty prominent examples of this within the last year; Luke Skywalker and Disney’s Mandalorian was Mark Hamill de-aged by about 25 to 30 years or so. Val Kilmer, who lost his voice after a battle with cancer, his voice was cloned to give him a platform to speak from, with a synthetic voice. Synthesia, our portfolio company which we are very proud investors, recently partnered with Lay’s and Lionel Messi, to create user-generated messages, which are customizable by name, by language and by a bunch of different things.

GitHub Copilot pair programming are still very early stages, but took a bunch of code leveraged GPT3, and then created this automatic code pair programming partner, which, the efficacy there can be debated and all the ethical concerns and so forth are of question. But is a very early example of using NLP to generate code at scale. On the model side, we saw a bunch of early NLP models, which were released within the last couple of years, start to see commercial adoption. So OpenAI and GPT3 aside with GitHub Copilot with a couple of Microsoft products embedded in as well, the 10X larger 1.75 trillion perimeter WuDao 2.0, which was launched by the Beijing Academy of Artificial Intelligence is a multimodal language and image model that is very much Chinese based, but is arguably the largest AI model in existence today.

And then on the European side, Aleph Alpha, which raised about $30, just over $33 million is out to build its own set of large operationalized AI systems, but with a European centric bent. So conforming to EU privacy laws, focusing on EU languages and so forth. It’s very exciting to see the commercialization of the last couple years of research start to come to fruition there. On the startup ecosystem side, we’re seeing good momentum within synthetic media, particularly within voice and video. You know, with Synthesia mentioned earlier in which we are proud investors as mentioned.

ML Ops → ModelOps

The ninth trend here is the maturation of ML Ops and the appearance and kind of early adoption of ModelOps. So ML Ops, which we wrote about in last year’s landscape is all about applying DevOps best practices to machine learning, whether that’s standardizing processes, the infrastructure, which model deployment and serving and training is on or putting more vigor in place around testing and deployment.

ML Ops is something that has been around for a couple of years in a couple of cycles, but the tooling around that is starting to mature outside of just big tech-centric early adopters. And what we’re seeing this year is the early emergence of ModelOps, which is a superset of ML Ops. So rather than focusing on just machine learning models, ModelOps is looking to operationalize all AI models. It looks to centralize all the models in one location and to standardize end to end processes around those models in the same way that ML Ops does for ML machine learning models. In this case, what this does is it helps to provide comprehensive governance and auditing capabilities for the models. And it brings in better explainability for not just data scientists, machine learning engineers and that part of the org, but also for business users to better understand explainability. Why are certain models producing the things they are, what’s causing shifts and models and so forth, and bringing that into the business user scope of the world.

Within ModelOps and MLOps, tooling is still in pretty early days in terms of commercialization, but from our perspective, we’re seeing a lot more point solutions pop up around different parts of the machine learning and broader AI model training process, whether it’s around model monitoring, model deployment, feature store and things of that nature. Definitely seeing a lot more proliferation and early adoption there. Moving on to the 10th and final trend.

The continued emergence of a separate Chinese AI stack

Yeah, we want, we can shift to something a little bit different from the tools and infrastructure that we talked about before, given that within AI and ML, there is an important geopolitical context to be aware of as well. And we talked a little bit about this with the data privacy and regulation that’s starting to show up in Europe, but as you may all know, there is somewhat of an arms race going on between China and the US within the AI world. What we’re seeing in practice here is that China has continued to mature as a global AI powerhouse, arguably with the large market advantage as the world’s largest producer of data. But that’s largely been Chinese contained, until pretty recently. So late in 2019 and 2020, we started to see the first major proliferation of Chinese consumer tech into the Western sphere. And, as you all know, TikTok is that piece with an aggressive marketing push followed by rapid adoption there, driven by what is arguably one of the best AI recommendation algorithms ever created.

On the infrastructure side, we’re starting to see some push as well. So historically with infrastructure in China, there’s been a little bit of a barrier to entry to US adoption due to geopolitical concerns and also to language barriers for Chinese engineers in terms of working on Western projects. And what that’s meant is that there’s a light layer of homegrown infrastructure as of a couple years ago, but that’s really escalated with the government push coupled with a number of escalations including the Huawei spat in 2018 towards building fully internalized infrastructure from the ground up, from hardware all the way through to the software layer. This broader movement is referred to in China as, [Mandarin 01:13:47] 国产化替代 (guóchǎn huà tìdài) or localization more broadly.

And for a long time, this whole industry was a pretty nascent and only in the last year or so, are we starting to really see a momentum come through, particularly, one example of that is in the [inaudible 01:14:04] 信创 (xìn chuàng) industry, which was originally seen as a little bit of a joke and hadn’t really matured for a long time. But this, in the past year started to see launches by notable Chinese cloud companies like Huayun (华云) and EasyStack in particular, along with CEC and a couple of others. This is really supported by a government driven approach as well as a strong nationalist pride bent. So the government has continued to pour money into research, both in terms of institutions and also in terms of commercial support as well. And notably with a couple of AI research parks, the $2.1 billion park in Beijing and also in Shanghai and a couple of other places as well.

This long term support and research has started to pay off with large scale model releases coming through. So [inaudible 01:15:00] PanGu-α which released earlier last year was overshadowed by WuDao 2.0 as mentioned earlier, which is the largest multimodal released to date. 1.75 trillion parameters, which is 10X the scale of GPT-3. So, that sums up our broad trends. Happy to take any questions. So let us know if there’s anything top of mind or any questions that we can answer.

[Matt] I don’t know what impressed me most, John, whether that’s your Chinese pronunciation or your ability to say multi model, multimodal model

[John] As a semi native Chinese speaker, probably, probably latter.

[Matt] Very, very impressive. I need to find a French word to say before this is over. All right. So we have a bunch of questions in the chat. Do you want to go maybe to like, John, to the slide with the trends, like maybe that’ll be a good background while we take some of those, some of those questions. So there are longer ones and smaller ones. I’ll start with a longer one from Samir.

Could you please explain how has the data changed today compared to the old big data years ago, new attributes and characteristics or is our view on data has been changed? So it’s an interesting question. Like, I don’t think the data itself has changed much. I think people used to geek out about the three V’s. Variety, Velocity, and something else that basically conveyed the idea that the data was all over the place and the hard to harness. There was this dichotomy in the data world around structured data and unstructured data, right. And that sort of led to the whole Snowflake versus Databricks kind of world. So Snowflake being a cloud data warehouse effectively was like a big database, right, for lack of a better analogy, that good query use standard SQL language but that was very focused on structured data, right? So there was like this world of like cloud Data Warehouse structured data leading to a type of analysis known as Business Intelligence, which is looking at existing data and sort of like prior history and tried to make sense of it. So, that, was one world.

And then separately, there was the world of the data lakes, which is more where companies like Databricks are coming from, which is all about unstructured data, right? Like it is one bigger repository where you dump data, whether that’s, that could be like audio files or like video or anything that’s sort of unstructured. And those data lakes were used historically to train machine learning models. So for predictive analytics. So you have on one side of the world of like structured data and data warehouses in SQL. And then on the other side, you had the world of unstructured data and data lakes. The story over the last few years has been the great convergence of those two stacks and in particular Databricks has worked on making their data lake more structured and look more like a data warehouse. And Snowflake is still in the early days of taking their structured data store and turning it into something that can work very well with unstructured data. So ultimately Snowflake is becoming a bit more like Databricks, Databricks is becoming a little more like Snowflake. Ultimately the idea is to be like the repository that enables people to work with data regardless of what it is both for sort of like retroactive analysis, which is Business Intelligence and predictive analytics, which is a world of data science and, and machine learning. So there was a little bit of a mouthful in answer to the question, but a really important trend in the landscape and something we discuss in the writeup.

So one into answer to the question – It would be great if you have a list of OSS projects on which the 2021 ML/AI data landscape companies are centered around. So we actually have on the landscape, if you look towards the bottom, this whole open source section – open source is absolutely fundamental to this software in general but certainly to this world. And I think we have a bunch of open source projects also listed in the spreadsheet that you can find the link somewhere in the write up. It’s absolutely agree open source is super important.

Let’s see question from Philip. Do higher than normal valuations for a lot of those companies, accelerate consolidation, particularly when funded primarily with stock? So it’s a really interesting question and it sort of cuts both ways, right? So higher evaluations in a way, make it easier to acquire a company when you, the acquirer, because I use stock is worth more money but equally they make it harder because the acquired company tends to be very expensive as well. So you hope that some kind of market equilibrium happens, but certainly if that happens, you’ll find a bunch of companies getting acquired for a lot more money than you would think, to reflect evaluation. So, like we, we’ve seen anecdotally a number of acquisitions of like seed stage companies somewhere between 50 to a hundred million for companies that were very early on but because everything is more expensive, that’s where the market equilibrium seems to be.

And then at the other end of the spectrum, let’s read the question of like, okay, well, do some companies at some point become just too expensive to acquire. And that’s certainly the case of Databricks. I mean, for the last few years they’ve been rumors or like people sort of trying to hypothesize where, whether Databricks will be acquired by Microsoft. They’ve historically been a very close partner to Microsoft. And at this stage, the question is whether even Microsoft could afford to acquire Databricks, considering Databricks’ latest evaluation was well like $38 billion right. So that would mean that would be, like, I have no idea, but like just, I’m sort of guessing it would’ve to be north of a hundred million, a hundred billion acquisition price. So like, can anybody afford like that much that high a price. So very interesting dynamic around valuations and having impact, acquisitions and all of us in the startup and this world are trying to figure it out as we go

[John] Question from Kyong here. So I saw a few references to AI/ML evaluation validation on the slides, for example, Arthur, where do you see that space going, will it become part of other areas, e.g. AI/ML development platform. So it’ll be a standalone offering. So I think from my perspective, monitoring by nature tends to be end to end. Whether it’s model monitoring or data monitoring, or anything related to that. There are issues that can arise and shift models. Whether it’s on performance side, data, the shape of data, things like that, or the input or anything else related to that. Those things kind of happen across the data life cycle.

So by nature, I think if you’re going to build a monitoring product, you need to be comprehensive in terms of coverage across the whole life cycle, whether it’s the whole data life cycle, or whether it’s the machine learning life cycle as a whole, inevitably that piece is going to be a platform to some extent. I think point solutions that only cover say post-deployment, or only during training or whatever it is. I think they’ll find it hard to compete with more comprehensive platforms in that piece.

But I, I think it’s a broader question of as a whole for AI/ML platforms. There are a lot of point solutions that have come up within the last, call it a year and a half to two years. There’s going to be some consolidation as they expand to scale and different solutions are going to try to get into adjacent parts of the model deployment, model training process and so forth. I think we’ll see much more fully fledged solutions in a couple years compared to what we have in the market today.

[Matt] All right. Maybe one or two more. And then we’ll wrap. Question from Saket. What is your research mechanism for identifying the companies here? How up to date is it? So in terms of like how up to date it is, it’s one of the things where like the ecosystem is so vibrant that the second we publish this it’s out of date, right? There’s literally new stuff happening every day, new financing, new companies, new projects being released and all the things. So it’s reasonably up to date but it’s not up to date to the minute.

We have this project that we’ll do at some point where we’ll have much more of a like live website, clickable type thing, which is something we’ve been meaning to do for years. But like at some point we’ll do it. And that will be meant to be up-to-date. In terms of the research mechanism. I mean, it’s a combination of different things by virtue of being VCs and very passionate about this world we spend tons of time in it. So we come across a lot of different companies. We certainly do a lot of research on all the existing databases. Some of them are more like venture centric databases. Like some of them are more open source or like software projects and all the things. So there’s a lot of pounding the pavements and the elbow grease type effort, which by the way, is part of the reason why we do this. I mean, it is great to put it out there and share with people, but the process of doing it for ourselves is extremely helpful because it forces us to learn.

But I’ll just close on this by mentioning that it is meant to be an opinionated document. It’s not a hundred percent rigorous in that in particular this year, we made the editorial decision to include a lot of early stage companies. And that’s because of what I mentioned earlier in the remarks, which is that there is a whole generation of companies that are appearing around the modern data stack and other parts of the ecosystem. A lot of those companies are seed or series A company in VC parlance and historically we would not have included them because we tended to prioritize companies that were a little further down the path and more mature. But since so much of the action is happening at the early stage these days we met the editorial decision to include them in the landscape.

[John] Here’s an interesting one from Pianpian. So how do you define AI startups? What do you think are the differences between startups that use AI and data versus a startup that focuses on infrastructure for other companies to use AI and data? This is actually a question that Matt and I debated at length this year, especially given the number of companies that are involved here. For us, if you look at the landscape, the left side of the landscape is the infrastructure companies. And then the right side is the companies that are up using ML/AI in an applied form.

For us, for a company to be included on here it either has to be a company that is providing infrastructure for other companies to use it, which is a little bit more clear cut or it has to be a company where AI or ML is core to their product offering in the sense that the product could not exist or does not exist without that AI/ML. So whether that’s something like a Gong, which uses conversational intelligence to draw insights from conversations or a company like Shein on the retail side which is aggregating large scale amounts of data to turn around retail, consumer fashion, to manufacturing to close loop quickly there. These companies, their secret sauce or their core value proposition is around having that ML to lift them up like high above other alternatives. So in that case, we would classify those as AI/ML companies.  If a company is first and foremost, an operational company, or, an old school retailer or something like that, where AI/ML is not core to their operations, we would classify that as out of scope for the purposes of this landscape.

[Matt] Great. Well, I think that’s a good place to wrap up. Hope that was interesting. That was like the first time we did this, as I shared and we certainly enjoyed doing it. And I want to thank our previous speaker who I think is one of the most interesting people in the data space these days. So thank you, John Mike, for joining us and talking with the data mesh and on behalf of John, myself and Katie and Jack at Firstmark, who play an absolutely essential role in putting this event together. I want to thank you for joining us tonight and we’ll have another exciting event next month, and we look forward to it. So thank you. And good night. We’ll see you next time.

Leave a Reply

Your email address will not be published.