In the hyper-frothy environment of 2019-2021, the world of data infrastructure (nee Big Data) was one of the hottest areas for both founders and VCs.
It was dizzying and fun at the same time, and perhaps a little weird to see so much market enthusiasm for products and companies that are ultimately very technical in nature.
Regardless, as the market has cooled down, that moment is over. While good companies will continue to be created in any market cycle, and “hot” market segments will continue to pop up, the bar has certainly escalated dramatically in terms of differentiation and quality for any new data infrastructure startup to get real interest from potential customers and investors.
Here is our take on some of the key trends in the data infra market in 2023.
The first couple are higher level and should be interesting to everyone, the others are more in the weeds:
- Brace for impact: bundling and consolidation
- The Modern Data Stack under pressure
- The end of ETL?
- Reverse ETL vs CDP
- Data mesh, products, contracts: dealing with organizational complexity
- Overall: A general trend towards convergence
- Bonus: What impact will AI have on data and analytics?
Brace for impact: bundling and consolidation
If there’s one thing the MAD landscape makes obvious year after year, it’s that the data/AI market is incredibly crowded.
In recent years, the data infrastructure market was very much in “let a thousand flowers bloom” mode.
The Snowflake IPO (the biggest software IPO ever) acted as a catalyst for this entire ecosystem. Founders started literally hundreds of companies, and VCs happily funded them (and again, and again) within a few months. New categories (e.g. reverse ETL, metrics stores, data observability) appeared and became immediately crowded with a number of hopefuls.
On the customer side, discerning buyers of technology, often found in scale ups or public tech companies, were willing to experiment and try the new thing, with little oversight from the CFO office. This resulted in many tools being tried and purchased in parallel.
Now, the music has stopped.
On the customer side, buyers of technology are under increasing budget pressure and CFO control. While data/AI will remain a priority for many even during a recessionary period, they have too many tools as it is, and they’re being asked to do more with less. They also have less resources to engineer, customize or stitch together anything. They’re less likely to be experimental, or work with immature tools and unproven startups. They’re more likely to pick established vendors that offer tightly integrated suites of products, stuff that “just works.”
This leaves the market with too many early stage data infrastructure companies doing too many overlapping things.
In particular, there’s an ocean of “single feature” data infrastructure (or MLOps) startups (perhaps too harsh a term, as they’re just at an early stage) that are going to struggle to meet this new bar. Those companies are typically young (1-4 years in existence) and due to limited time on earth, their product is still largely a single feature, although every company hopes to grow into a platform; they have some good customers, but not a resounding product market-fit just yet; their ARR is low, often below $5M; they are venture-backed, often raised at 50x-200x ARR in the last couple of years; they compete with a group of other VC-backed startups led by smart founders who are more or less at the same stage; they are unprofitable with a cash runway ranging from 6 months to 3 years.
This class of companies has an uphill battle in front of them – a tremendous amount of growing to do, in a context where buyers are going to be weary and VC cash scarce.
Expect the beginning of a Darwinian period ahead. The best (or luckiest, or best funded) of those companies will find a way to grow, expand from a single feature to a platform (say, from data quality to a full data observability platform), and deepen their customer relationships.
Others will be part of an inevitable wave of consolidation, either as a tuck-in acquisition for a bigger platform, or as a startup-on-startup private combination. Those transactions will be small, and unlikely to produce the kind of returns founders and investors were hoping for. (We are not ruling out the possibility of multi-billion dollar deals in the next 12-18 months, especially in anything that has to do with AI, but those are likely to be few and far between, at least until potential public acquirers ee the light at the end of the tunnel in terms of the recessionary market).
Still, small acquisitions and startup mergers will be better than simply going out of business. Bankruptcy, an inevitable part of the startup world, will be much more common than in the last few years, as companies cannot raise their next round or find a home. As many startups are still sitting on the cash they raised in the last year or two, that wave has not even really started yet.
At the top of the market, the larger players have already been in full product expansion mode. It’s been the cloud hyperscaler’s strategy all along to keep adding products to their platform. Now Snowflake and Databricks, the rivals in a titanic shock to become the default platform for all things data and AI (see the 2021 MAD landscape), are doing the same.
Databricks seems to be on a mission to release a product in just about every box of the MAD landscape. It offers a data lake(house), streaming capabilities, a data catalog (Unity Catalog, now with lineage), a query engine (Photon), a whole series of data engineering tools, a data marketplace, data sharing capabilities, and a data science and enterprise ML platform. This product expansion has been done almost entirely organically, with a very small number of tuck-in acquisitions along the way – Datajoy and Cortex Labs in 2022.
Snowflake has also been releasing features at a rapid pace. It has become more acquisitive as well. It announced three acquisitions in the first couple of months of 2023 already: LeapYear, SnowConvert and Myst AI. And it made its first big acquisition when it picked up Streamsets for $800M.
Confluent, the public company built on top of open-source streaming project Kafka, is also making interesting moves by expanding to Flink, a very popular streaming processing engine. It just acquired Immerok. This was a quick acquisition, as Immerok was founded in May 2022 by a team of Flink committees and PMC members, funded with $17M in October and acquired in January 2023.
Well-funded, unicorn type startups are also starting to expand aggressively, starting to encroach on other’s territories in an attempt to grow into a broader platform.
As an example, transformation leader dbt Labs first announced a product expansion into the adjacent semantic layer area in October 2022. Then, it acquired an emerging player in the space, Transform (dbt’s blog post provides a nice overview of the semantic layer and metrics store concept) in February 2023. To learn more about dbt, see my conversation with Tristan Handy, CEO, dbt Labs at Data Driven NYC
Some categories in data infrastructure feel particularly ripe for a consolidation of some sort – the MAD landscape provides a good visual aid for this, as potential for consolidation maps pretty closely with the fullest boxes:
“ETL” and “Reverse ETL”: Over the last three or four years, the market has funded a good number of ETL startups (to move data into the warehouse), as well as a separate group of reverse ETL startups (to move data out of the warehouse). It is unclear how many startups the market can sustain in either category. Reverse ETL companies are under pressure from different angles (see below), and it is possible that both categories may end up merging. ETL company Airbyte acquired Reverse ETL startup Grouparoo. Several companies like Hevo Data position as end-to-end pipelines, delivering both ETL and reverse ETL (with some transformation too), as does data syncing specialist Segment. Could ETL market leader FIvetran acquire or (less likely) merge with one of its Reverse ETL partners like Census or Hightouch?
“Data Quality & Observability”: The market has seen a glut of companies that all want to be the “Datadog of data”. What Datadog does for software (ensure reliability and minimize application downtime), those companies want to do for data – detect, analyze and fix all issues with respect to data pipelines. Those companies come at the problem from different angles – some do data quality (declaratively or through machine learning), others do data lineage, others do data reliability. Data orchestration companies also play in the space. Many of those companies have excellent founders, are backed by premier VCs and have built quality products. However, they are all converging in the same direction, in a context where demand for data observability is still comparatively nascent. To learn more about companies in the space: see this Data Driven NYC talk by Gleb Mezhanskiy, CEO of Datafold or my Data Driven NYC conversation with Barr Moses, CEO, Monte Carlo.
“Data Catalogs”: As data becomes more complex and widespread within the enterprise, there is a need for an organized inventory of all data assets. Enter data catalogs, which ideally also provide search, discovery and data management capabilities. While there is a clear need for the functionality, there are also many players in the category, with smart founders and strong VC backing, and here as well, it is unclear how many the market can sustain. It is also unclear whether data catalogs can be separate entities outside of broader data governance platforms long term. For a glimpse into interesting data catalog companies, see my Data Driven NYC conversation with Mark Grover, CEO of Stemma, and this great Data Driven NYC presentation by Shinji Kim, CEO of Select Star. Also, for a broader overview of Data Governance, see my Data Driven NYC conversation with Felix Van de Maele, CEO, Collibra.
“MLOps”: While MLOps sits in the ML/AI section of the MAD landscape, it is also infrastructure and it is likely to experience some of the same circumstances as the above. Like the other categories, MLOps plays an essential role in the overall stack, and it is propelled by the rising importance of ML/AI in the enterprise. However, there is a very large number of companies in the category, most of which are well funded but early on the revenue front. They started from different places (model building, feature stores, deployment, transparency, etc.) but as they try to go from single-feature to a broader platform, they are on a collision course with each other. Also, many of the current MLOps companies have primarily focused on selling to scale-ups and tech companies. As they go upmarket, they may start bumping into the enterprise AI platforms that have been selling to Global 2000 companies for a while, like Dataiku, Datarobot, H2O, as well as the cloud hyperscalers. For an interesting glimpse into MLOps, especially on the trust and explainability side, see my Data Driven NYC conversation with Krishna Gade, CEO of Fiddler.
The Modern Data Stack under pressure
A hallmark of the last few years has been the rise of the “Modern Data Stack” (MDS). Part architecture, part de facto marketing alliance amongst vendors, the MDS is a series of modern, cloud-based tools to collect, store, transform and analyze data. At the center of it, there’s the cloud data warehouse (Snowflake, etc.). Before the data warehouse, there are various tools (Fivetran, Matillion, Airbyte, Meltano, etc) to extract data from their original sources and dump it into the data warehouse. At the warehouse level, there are other tools to transform data, the “T” in what used to be known as ETL (extract transform load) and has been reversed to ELT (here dbt Labs reigns largely supreme). After the data warehouse, there are other tools to analyze the data (that’s the world of BI, for business intelligence), or extract the transformed data and plug back into SaaS applications (a process known as “reverse ETL”).
In other words, a real assembly chain, with many tools handling different stages of the process:
Up until recently, the MDS was a growing and very cooperative world. As Snowflake’s fortunes kept rising, so would the entire ecosystem around it.
Now, the world has changed. As cost control becomes paramount, some may question the philosophy that has been at the heart of the modern approach to data management since the Hadoop days – keep all your data, dump it all somewhere (a data lake, lakehouse or warehouse) and figure out what to do with it later. This approach led to the rise of data warehouses, the centerpiece of the MDS, but it has turned out to be expensive, and not always that useful (read this good piece: “Big Data is Dead”). New technologies like DucksDB, which enable embedded interactive analytics, offer a possible new approach to OLAP (analytics).
The MDS is now under pressure. In a world of tight budgets and rationalization, it is almost too obvious a target. It’s complex (as customers need to stitch everything together and deal with multiple vendors). It’s expensive (lots of copying and moving data; every vendor in the chain wants their revenue and margin; customers often need an in-house team of data engineers to make it all work, etc). And it is, arguably, elitist (as those are the most bleeding-edge, best-in-breed tools, serving the needs of the more sophisticated users with the more advanced use cases).
As pressure increases, what happens when MDS companies stop being friendly and start competing with one another for smaller customer budgets?
As an aside, the complexity of the MDS has given rise to a new category of vendors that “package” various products under one fully managed platform (as mentioned above, we created a new box in the 2023 MAD featuring companies like Y42 or Mozart Data). The underlying vendors are some of the usual suspects in MDS, the benefit of those platforms being that they abstract away both the business complexity of managing those vendors individually and the technical complexity of stitching together the various solutions. Worth noting that some fully managed platforms have built the whole suite of functionalities themselves and don’t package third party vendors.
The end of ETL?
As a twist on the above, there’s a parallel discussion in data circles as to whether ETL should even be part of data infrastructure going forward. ETL, even with modern tools, is a painful, expensive and time consuming part of data engineering.
At its Re:Invent conference last November, Amazon asked “What if we could eliminate ETL entirely? That would be a world we would all love. This is our vision, what we’re calling a zero ETL future. And in this future, data integration is no longer a manual effort”, announcing support for “zero-ETL” solution that tightly integrates Amazon Aurora with Amazon Redshift. Under that integration, within seconds of transactional data being written into Aurora, the data is available in Amazon Redshift.
The benefits of an integration like this are obvious – no need to build and maintain complex data pipelines, no duplicate data storage (which can be expensive), and always up-to-date.
Now, an integration between two Amazon databases in itself is not enough to lead to the end of ETL alone, and there are reasons to be skeptical a Zero ETL future would happen soon.
But then again, Salesforce and Snowflake also announced a partnership to share customer data in real time across systems without moving or copying data, which falls under the same general logic. Before that, Stripe had launched a data pipeline to help users sync payments data with Redshift and Snowflake.
The concept of change data capture is not new, but it’s gaining steam. Google already supports change data capture in BigQuery. Azure Synapse does the same by pre-integrating Azure Data Factory. There is a rising generation of startups in the space like Estuary* and Upsolver.
Our sense is that we’re a long way from ETL disappearing as a category, but the trend is noteworthy.
Reverse ETL vs CDP
Another somewhat-in-the-weeds, but fun to watch part of the landscape has been the tension between Reverse ETL (again, the process of taking data out of the warehouse and putting it back into SaaS and other applications) and Customer Data Platforms (products that aggregate customer data from multiple sources, run analytics on them like segmentation, and enable actions like marketing campaigns).
Over the last year or so, the two categories started converging into one another.
Reverse ETL companies presumably learned that “just” being a pipeline on top of a data warehouse (not an easy technical feat) wasn’t commanding enough wallet share from customers, and that they needed to go further in providing value around customer data. Many Reverse ETL vendors now position themselves as CDP from a marketing standpoint.
Meanwhile, CDP vendors learned that being another repository where customers needed to copy massive amounts of data was at odds with the general trend of centralization of data around the data warehouse (or lake or lakehouse). Therefore, CDP vendors started offering integration with the main data warehouse and lakehouse providers. See for example ActionIQ* launching HybridCompute, mParticle launching Warehouse Sync, or Segment introducing Reverse ETL capabilities. As they beef up their own reverse ETL capabilities, CDP companies are now starting to sell to a more technical audience of CIO and analytics teams, in addition to their historical buyers (CMOs).
Where does this leave Reverse ETL companies? One way they could evolve is to become more deeply integrated with the ETL providers, which we discussed above. Another way would be to further evolve towards becoming a CDP by adding analytics and orchestration modules.
Data mesh, products, contracts: dealing with organizational complexity
As just about any data practitioner knows firsthand: success with data is certainly a technical and product effort, but it also very much revolves around process and organizational issues.
In many organizations, the data stack looks like a mini-version of the MAD landscape. You end up with a variety of teams working on a variety of products. So how does it all work together? Who’s in charge of what?
Debate has been raging in data circles about how to best go about it. There’s a lot of nuances and a lot of discussions with smart people disagreeing on, well, just about any part of it – but here’s a quick overview.
We had highlighted the data mesh as an emerging trend in the 2021 MAD landscape. It’s only been gaining traction since. The data mesh is a distributed, decentralized (not in the crypto sense) approach to managing data tools and teams. See our Data Driven NYC Fireside Chat: Zhamak Dehghani, the originator of the concept (and now CEO of NextData).
Note how it’s different from a data fabric – a more technical concept, basically a single framework to connect all data sources within the enterprise, regardless of where they’re physically located.
The data mesh leads to a concept of data products – which could be anything from a curated data set to an application or an API. The basic idea is that each team that creates the data product is fully responsible for it (including quality, uptime, etc). Business units within the enterprise then consume the data product on a self-service basis.
A related idea is data contracts – “API-like agreements between software engineers who own services and data consumers that understand how the business works in order to generate well-modeled, high-quality, trusted, real-time data” (read: “The Rise of Data Contracts”). There’s been all sorts of fun debates about the concept (watch: “Data Contract Battle Royale w/ Chad Sanderson vs Ethan Aaron”). The essence of the discussion is whether data contracts only make sense in very large, very decentralized organizations, as opposed to 90% of smaller companies.
Overall: A general trend towards convergence
Throughout this section, we’ve danced around the same theme – an overall need for simplification in data infrastructure, for the ultimate benefit of the customer.
Some of the simplification will be company-driven – companies adding more features and functionality to their product line.
Some of it will be market-driven – companies consolidations through acquisitions, mergers, or sadly, going out of business.
Lastly, some has been, and will continue to be technology-driven. The convergence of streaming and batch processing is an evergreen, and important theme. So is the convergence of transactional (OLTP) and analytical (OLAP) workloads. AlloyDB from Google is the latest entrant in that field, claiming being 100x faster than standard PostgreSQL for analytical queries. And Snowflake launched Unistore, offering lightweight (for now) transaction processing capabilities, yet another step in an overall journey towards breaking down silos between transactional and analytical data.
Bonus: How will AI impact data infrastructure?
With the current explosive progress in AI, here’s a fun question: data infrastructure has certainly been powering AI, but will AI now in turn impact data infrastructure?
For sure, some data infrastructure providers have already been using AI for a while – see for example, Anomalo leveraging ML to identify data quality issues in the data warehouse. And many database vendors now embed auto-ML capabilities.
But with the rise of Large Language Models, there’s a new interesting angle. Just the way LLMs can create conventional programming code, they can also generate SQL, the language of data analysts. The idea of enabling non-technical users to search analytical systems is not new, and various providers already support variations of it, see ThoughtSpot, Power BI or Tableau. Here are some good pieces on the topic: LLM Implications on Analytics (and Analysts!) by Tristan Handy of dbt Labs and The Rapture and the Reckoning by Benn Stancil of Mode.
READ NEXT: MAD 2023, PART IV: TRENDS IN ML/AI