Every year, as part of our MAD project, we do a presentation at Data Driven NYC about the top trends we see across data and ML/AI. (here’s the 2022 version for reference).
The presentation, done this year with my FirstMark colleague Kevin Zhang, is a whirlwind tour of top trends, as opposed to anything particularly in-depth, as we tried to keep it short. But hopefully it should provide a good overview of what’s been happening in those spaces, for anyone interested in a recap.
Cockroach Labs, the ambitious database company with a funny name, has gone from strength to strength over the last few years. Started with three ex-Googlers in 2014, it successfully navigated in its early years the perilous waters of being an early database company that customers need to trust for mission-critical applications. Over time, it’s gained tremendous momentum with a now long list of marquee customers, and was most recently valued at $5B.
In part because we at FirstMark are proud investors in the company, we’ve featured Cockroach Labs several times at Data Driven NYC over the years: in 2014 (video), 2018 (video) and 2020 (video), and it’s been really fun to see their tremendous progress.
It was great to host CEO Spencer Kimball once again and check in on the latest, as well as lessons learned building a successful open source enterprise software company.
We covered a bunch of really interesting things, including:
The origins of the company
The evolution of the database market from SQL to NoSQL to NewSQL to cloud
The current opportunity around serverless
Open source license questions
Go to market: community led, bottoms up, top down?
Who’s the perfect first sales hire for an enterprise software company
In addition to his role as co-founder and Chief Analytics Officer of Mode, a leading collaborative data platform, Benn Stancil is a prolific and thought-provoking writer about the broad data space. Over the last couple of years in particular, he’s produced a series of insightful and entertaining posts on his newsletter: https://benn.substack.com/
We had welcomed Benn at Data Driven NYC back in 2019 to talk about Mode (see the video, “The case for hiring more data analysts“), and it was great to have him back from a wide-encompassing conversation where he addressed some of the “sacred cows” of the data world.
One of the most interesting conversations on the space we’ve had recently, highly recommended watch!
In a world where everything moves ever faster, it seems inevitable that data infrastructure will need to move sooner or later to a predominantly real-time paradigm. Yet the infrastructure for real-time data is still trailing far behind its batch processing cousin.
Enter Estuary, a real-time data ops platform, in which my firm FirstMark led a large seed round last year. Estuary enables you to synchronize your data products across all your systems (whether databases, SaaS, pub/sub, etc) in real-time, and also to join aggregate, join, or otherwise take action on, your data while in motion. Estuary is not a database – instead it makes your databases real time. It abstracts away the complexity of building real-time, data-intensive applications at scale.
It was a lot of fun to host at Data Driven NYC Estuary’s co-founder and CTO Johnny Graettinger for a fun, approachable and educational talk about the company, its product and the real-time data world.
In the world of data infrastructure, dbt Labs has undoubtedly been one of the most exciting startups to watch. The company is the creator and maintainer of dbt, a data transformation tool that enables data analysts and engineers to transform, test and document data in the cloud data warehouse. Beyond this, the company is empowering a new generation of data analysts and enabling them to create and disseminate organizational knowledge.
dbt’s CEO, Tristan Handy, is also one of the most thoughtful and interesting CEOs in the space, having played a pivotal role in the emergence of what’s often referred to as the “Modern Data Stack”, a suite of tools and processes that leverage the power of cloud data warehouses to bring data processing to the modern era.
We had the pleasure of hosting Tristan once during the pandemic in 2021 for a greatonlinechat with Jeremiah Lowin, CEO of Prefect. It was a particular treat to welcome back Tristan, this time for our first in-person event since 2020!
As enterprises around the world deploy machine learning and AI in actual production, it’s becoming increasingly critical that AI can be trusted to produce not just accurate, but also fair and ethical results. An interesting market opportunity has opened up to equip enterprises with the tools to address those issues.
At our most recent Data Driven NYC, we had a great chat with Krishna Gade, co-founder and CEO of Fiddler, a platform to “monitor, observe, analyze and explain your machine learning models in production with an overall mission to make AI trustworthy for all enterprises”. Fiddler has aised $45 million in venture capital to date, most recently a $32 million Series B just last year in 2021.
We got a chance to cover some great topics, including:
What does “explainability” mean, in the context of ML/AI? What is “bias detection”?
What are some examples of business impact of “models gone bad”?
A dive into the Fiddler product and how it addresses the above?
Where are we in the cycle of actually deploying ML/AI in the enterprise? What’s the actual state of the market?
Below is the video and full transcript. As always, please subscribe to our YouTube channel to be notified when new videos are released, and give your favorite videos a “like”!
In the ever vibrant world of the “Modern Data Stack” (an ecosystem of mostly young tech startups that represent the rising generation of data software vendors, and integrate well with one another), Hex has been getting increasing visibility and momentum. At its core, Hex is a collaborative data platform where teams can explore, analyze, and share. It aims to bring together the best of notebooks, BI & docs into a seamless, collaborative UI.
The company was founded in 2019 and you raised a total of $73.5 million in venture capital to date, including most recently a $52 million Series B.
CEO Barry McCardel joined us at Data Driven NYC for a deep dive in to the product, the company, the data space and his journey from doing “unholy things in Excel” as a young consultant to building a great startup.
As more and more companies around the world rely on data for competitive advantage and mission-critical needs, the stakes have increased tremendously, and data infrastructure needs to be utterly reliable.
In the applications world, the need to monitor and maintain infrastructure gave rise to an entire industry, and iconic leaders like Datadog. Who will be the Datadog of the data infrastructure world? A handful of data startups have thrown their hat in the ring, and Monte Carlo is certainly one of the most notable companies in that group.
Monte Carlo presents itself as an end-to-end dataobservability platform that aims to increases trust in data by eliminating data downtime, so engineers innovate more and fix less. Started in 2019, the company has already raised $101M in venture capital, most recently in a Series C announced in August 2021.
It was a real pleasure to welcome Monte Carlo’s co-founder and CEO, Barr Moses, for a fun and educational conversation about data observavibility and the data infrastructure world in general.
Our business lives are full of optimization problems – scheduling, time management, resource planning, pricing, routing, risk management, network optimization, financial engineering, etc. Simply defined, optimization is the science of making the best decision possible, given a set of constraints.
Historically, optimization has been the province of PhDs with deep backgrounds in mathematics, using a generation of software that was developed for academia and large defense contractors.
Enter Nextmv (proncounded “Next Move”), a company in which I’m a proud investor. Nextmv is reinventing the space for the cloud era, making optimization and simulation technologies available to every developer.
It was great to welcome Nextmv’s CEO, Carolyn Mooney, at our most recent Data Driven NYC to talk abotu the space and the company.
We covered:
What is decision intelligence, and how does it differ from business intelligence and data science?
What is the overlap with the area known as “operations research”?
How decision intelligence is broadly horizontal area
How Nextmv is democratizing decision intelligence with its cloud product
Bonus: Nextmv’s policy of radical transparency on team compensation
The world of data governance is not the most visible part of the data revolution, yet it is of critical importance. As more and more data floats into the enterprise, and its role is ever more mission critical, one needs to be in full control of it – understand where data resides, who can have access to it, which datasets can be trusted or not, etc.
Enter Collibra, a startup that has had a long march towards success, as it was founded in 2008. Collibra has now become an impressive industry leader and raised a $250 million Series G at a post money valuation of $5.25 billion last year.
We had had the chance to host Stan Christiaens, the co-founder and CTO of Collibra at Data Driven NYC in 2017 (video here), and this time we got a chance to chat with the company’s CEO, Felix Van de Maele.
We had a great conversation, starting with a round of definitions that should be interesting to anyone curious to better understand that side of the data world.
The last couple of years have seen a dramatic acceleration in the adoption of graph databases, a category of databases that stores nodes and relationships instead of tables, or documents. That acceleration has clearly benefited Neo4j, which had a banner year in 2021, surpassing $100M in ARR and closing a $325M series F financing round at over $2B valuation, which it calls “the largest funding round in database history”.
That would make Neo4j an overnight success, except for the fact that Neo4j started in 20007, pioneered the space and literally coined the term “graph database”.
Neo4j’s CEO, Emil Eifrem, had spoken at Data Driven NYC back in 2015 (the same night as the CEO of Snowflake and the CEO of Airtable, a pretty stacked line up considering those three startups combined went on to represent many billions of market cap/valuations).
So it was particularly fun to have Emil back at the event and exciting to hear about the major progress the company has experienced over the last few years. Emil spoke from Sweden at around midnight his time, bringing impressive energy despite the late hour and it was a great conversation.
I’ve been interested in the intersection of AI and crypto for a while (see AI & Blockchain: An introduction), and Numerai is one of the most exciting companies I came across in that world. Numerai is a new kind of crowdsourced quant hedge fund, which provides data for free and enables any data scientist around the planet to contribute models they believe will beat the stock market. Numerai offers its own token, called Numeraire, to incentivize participants.
For anyone interested in a quick overview of our long-form 2021 Machine Learning, AI and Data (MAD) Landscape, here are the Cliffs Notes! My co-author John and I did a presentation at our most recent Data Driven NYC, focused on top 10 trends in this year’s landscape.
As a preview, here they are:
Every company is a data company
The big unlock: data warehouses and lakehouses
Consolidation vs data mesh: the future is hybrid
An explosive funding environment
A busy year in DataOps
It’s time for real time
The action moves to the right side of the warehouse
The rise of AI generated content
From MLOps to ModelOps
The continued emergence of a separate Chinese AI stack
Below is the video from the event, and below that, the transcript.
In the admittedly small world of people who obsess over data technologies, one of the hottest topics of the last year has been the “data mesh”.
Created by Zhamak Dehghani of ThoughtWorks, the concept struck a chord and made the rounds in countless conversations on Twitter and elswhere.
As I highlighted in the 2021 MAD Landscape, the data mesh concept is both a technological and organizational idea. A standard approach to building data infrastructure and teams so far has been centralization: one big platform, managed by one data team, that serves the needs of business users. This has advantages, but also can create a number of issues (bottlenecks, etc). The general concept of the data mesh is decentralization – create independent data teams that are responsible for their own domain and provide data “as a product” to others within the organization. Conceptually, this is not entirely different from the concept of micro-services that has become familiar in software engineering, but applied to the data domain.
It was a real treat to get to chat with Zhamak at our most recent Data Driven NYC.
Below is the video and below that, the transcript.