Part I of the 2019 Data & AI Landscape covered issues around the societal impact of data and AI, and included the landscape chart itself. In this Part II, we’re going to dive into some of the main industry trends in data and AI.
The data and AI ecosystem continues to be one of the most exciting areas of technology. Not only does it have its own explosive momentum, but it also powers and accelerates innovation in many other areas (consumer applications, gaming, transportation, etc). As such, its overall impact is immense, and goes much beyond the technical discussions below.
Of course, no meaningful trend unfolds over the course of just one year, and many of the following has been years in the making. We’ll focus the discussion on trends that we have seen particularly accelerating in 2019, or gaining rapid prominence in industry conversations.
We will loosely follow the order of the landscape, from left to right: infrastructure, analytics and applications.
We see three big trends in infrastructure:
- A third wave? From Hadoop to cloud services to Kubernetes
- Data governance, cataloging, lineage: data management is increasingly important
- The rise of an AI-specific infrastructure stack
The data infrastructure world continues its own rapid evolution. The main arc here, which has been playing out for years but seems to be accelerating, is a three phase transition from Hadoop to the cloud services to a hybrid/Kubernetes environment.
Hadoop is very much the “OG” of the Big Data world, dating back to an October 2003 paper. A framework for distributed storage and processing of massive amounts of data using a network of computers, it played an absolutely central role in the explosion of the data ecosystem.
Over the last few years, however, it has become a bit of a sport among industry watchers to pronounce Hadoop dead. This trend accelerated further this year, as Hadoop vendors ran into all sorts of trouble. MapR has been on the brink of shutting down and may have found a buyer, at the time of writing. The recently merged Cloudera and Hortonworks, fresh off their $5.2B merger had a rough day in June when the stock plummeted 40% as a result of disappointing quarterly earnings. Cloudera has announced a variety of cloud and hybrid products, but they have not launched yet.
Hadoop is running into increasing headwinds as a direct result of competition from cloud platforms. Hadoop was developed at a time when the cloud was not a serious option, most data was on-premise, network latency was a real bottleneck and therefore keeping data and compute co-located made a lot of sense. The world has now changed.
However, it is unlikely that Hadoop is going to go away anytime soon. Its adoption may slow down, but the sheer magnitude of its deployment across enterprises will give it inertia and staying power for years to come.
Regardless, the transition to the cloud is clearly accelerating. Anecdotally, in our conversations with Fortune 1000 executives, 2019 has felt like a real shift. Over the last few years, it was almost a dirty secret that, for all the talk about the cloud, the real action was on-prem, especially in regulated industries. Many of the same Fortune 1000 executives are actively moving to the cloud, with a particular segment of activity involving traditional Microsoft shops making the switch to Azure.
As a result, the cloud providers continue to grow rapidly, despite their already massive scale. AWS generated $25.7 billion in revenue in 2018, up 46.9% from $17.5 billion in 2017. Microsoft Azure’s revenues aren’t disclosed separately but grew 73% yoy for the quarter ended March 2019. Not a perfect comp but AWS’ revenue grew 41% yoy for the same quarte
While cloud usage deepens, customers are beginning to balk at costs. In board rooms all around the world, executives have suddenly taken notice of a line item that used to be small and has now snowballed very rapidly: their cloud bill. The cloud does offer agility, but it can often come at a high price, particularly if customers take their eye off the meter or fail to accurately forecast their computing needs. There are many stories of AWS customers like Adobe and Capital One that saw their bill grow 60%+ over just one year between 2017 and 2018, to well over $200M.
Costs, as well as concerns over vendor lock-in, have precipitated the evolution towards a hybrid approach, involving a combination of public cloud, private cloud and on-prem. Faced with a myriad of options, enterprises will increasingly select the best tool for the job to optimize performance and economics. As cloud providers more aggressively differentiate themselves, enterprises are adapting with multi-cloud strategies that leverage what each cloud provider is best at. And in some cases, the best approach is to keep (or even repatriate) some workloads back on-premises in order to optimize economics, especially for non-dynamic workload
Interestingly, cloud providers are adapting to the reality that enterprise computing will occur in a mix of environments by providing tools such as AWS Outposts which allows customers to run compute and storage on-premises as well as seamlessly integrate on-premise workloads with the rest of their applications in the AWS cloud.
In this new multi-cloud and hybrid cloud era, the rising superstar is undoubtedly Kubernetes. A project for managing containerized workloads and services open sourced by Google in 2014, Kubernetes is experiencing the same fervor as Hadoop did a few years ago, with 8,000 attendees at its KubeCon event, and a never ending stream of blog posts and podcasts. Many analysts believe that Red Hat’s prominence in the Kubernetes world largely contributed to its massive acquisition by IBM for $34B. The promise of Kubernetes is very much to help enterprises run their workloads across their own datacenter and private cloud, as well as one or several public clouds.
As an orchestration framework that’s particularly apt at managing complex, hybrid environments, Kubernetes is also becoming an increasingly attractive option for machine learning. Kubernetes gives data scientists the flexibility to choose whichever language, machine learning library or framework they prefer, and train and scale models, allowing for comparatively rapid iteration and strong reproducibility, without having to be infrastructure experts, with the same infrastructure serving multiple users (more here). Kubeflow, a machine learning toolkit for Kubernetes, has been gaining rapid momentum.
Kubernetes is still relatively nascent, but interestingly, the above could signal an evolution away from the cloud machine learning services, as data scientists may prefer the overall flexibility and controllability of Kubernetes. We could be entering a third paradigm shift for data science and ML infrastructure, from Hadoop (up until 2017?) to data cloud services (2017-2019) to a world dominated by Kubernetes and next-generation data warehouses like Snowflake (2019-?).
The flipside of this evolution is increased complexity. There is certainly an opportunity to provide a full platform that would abstract a lot of the cloud underlying infrastructure complexity and make this brave new world more accessible to a broader group of data scientists and analysts.
Serverless is one attempt at such simplification, albeit with a different angle. This execution model enables users to write and deploy code without the hassle of worrying about the underlying infrastructure. The cloud provider handles all backend services and the customer is charged based on what they actually use. Serverless has certainly been a key emerging topic in the last couple of years, and this is another new category we’ve added to this year’s Data & AI Landscape. However, the applicability of serverless to machine learning and data science is still a very much a work in progress, with companies like Algorithmia and Iguazio/Nuclio being early entrants.
Another consequence of the increasingly hybrid nature of the data environment is in the enterprise is the need to ramp up efforts to gain control of one’s data.
In a world where some data lives in a data warehouse, some in a data lake, some in various other sources, across on-prem, private cloud and public cloud, how do you find, curate, control and trace data? Those efforts take various related forms and names, including data querying, data governance, data cataloging and data lineage, all of which are gaining increasing importance and prominence.
Querying data across a hybrid environment is its own challenge, with solutions that fall within the general trend of separating storage and compute (see this video from Starburst Data, a company offering an enterprise version of SQL query engine Presto, from our Data Driven NYC event).
Data governance is another area that’s rapidly becoming top of mind in the enterprise. The general idea of data governance is to manage one’s data, and make sure that it’s of high quality throughout the lifecycle of data It touches on areas such as data availability, integrity, usability, consistency, integrity and security. Notably, in early 2019, Collibra raised a $100M round at over a $1B valuation.
Data catalogs are another increasingly important flavor of data management. Effectively data catalogs are dictionaries that synthesize an enterprise’s various data assets. They enable users, including data scientists, data analysts, developers and business users, to discover and consume data in a self-service context. See this good description by leading vendor Alation.
Finally, data lineage is perhaps the most recent category of data management to emerge. Data lineage is meant to capture the “journey of data” across the enterprise. It helps companies figure out how data was gathered, and how it was modified and shared across its lifecycle. The growth of this segment is driven by a number of factors including the increasing importance of compliance, privacy and ethics, as well as the need for reproducibility and transparency of machine learning pipelines and models. Here’s a good podcast on the topic from O’Reilly.
The final key trend that has been accelerating this year is the continued emergence of an AI-specific infrastructure stack.
The need to manage AI pipelines and models has given rise to the rapidly growing MLOps (or AIOps) category. To acknowledge this new-ish trend, we have added two new boxes to this year’s Landscape, one under Infrastructure (with various early stage startups including Algorithmia, Spell, Weights & Biases, etc.) and one under Open Source (with a variety of projects, typically fairly early as well, including Pachyderm, Seldon, Snorkel, MLeap, etc.).
ML engineers need to be able to run experiments and rapidly iterate, accessing resources such as GPUs when needed. At our Data Driven NYC event, we have featured a number of early stage startups providing such infrastructure including Spell (video), Comet (video), Paperspace (video).
AI is having a profound impact on infrastructure even at the lower levels of the stack, with the rise of GPU databases and the birth of a new generation of AI chips (Graphcore, Cerebras, etc.). AI may be forcing us to rethink the entire nature of compute.
In analytics, we’ll highlight a couple of key trends:
- Business Intelligence (BI) is consolidating
- The action is moving to Enterprise AI platforms
- Horizontal AI continues to be very vibrant
In business intelligence, the unmistakable trend of the last few months has been the burst of consolidation activity that we mentioned earlier in this post, with the acquisitions of Tableau, Looker, Zoomdata and Clearstory, as well as the merger between SiSense and Periscope (Henry Glaser, CEO of Periscope, had spoken at Data Driven NYC last year).
With the benefit of 20/20 hindsight, consolidation in BI was somewhat inevitable, as the data visualization and self-service analytics space had commoditized, with a plethora of pure-play vendors. Every vendor, big and small, was under pressure to diversify and expand capabilities. For cloud acquirers, those new product lines will certainly add revenue, but more importantly, they have attachment power, as yet another tool to help generate core platform revenue.
Will there be more consolidation in BI? Microsoft has a strong position with Power BI, but M&A markets can have their own dynamic when an entire segment consolidates and every company effectively is in play. AWS may have a stronger product need, considering its QuickSight BI is generally thought to be a bit behind.
As BI consolidates, the heat continues to increase in the data science and machine learning platform segments. The deployment of ML/AI in the enterprise is a mega-trend that is still in its early innings, and various players are rushing to build the platform of choice.
For most companies in the space, the clear goal is to facilitate the democratization of ML/AI, making its benefits accessible to larger groups of users and companies, in a context where the ongoing talent shortage in ML/AI continues to be a major bottleneck to broad adoption. However, different players have different strategies.
One approach is AutoML. It involves automating entire parts of the machine learning lifecycle, including some of the most tedious ones. Depending on the product, AutoML will handle anything from feature generation and engineering, algorithm selection, and model training, deployment and monitoring. DataRobot, an AutoML specialist, raised a $100M Series D (and allegedly more since) since our 2018 Landscape.
Other companies in the space, such as Dataiku, H20 and RapidMiner offer platforms that feature AutoML capabilities too, but also offer broader capabilities. Dataiku, for example, raised a large $101M Series C since our 2018 Landscape, with an overall philosophy of empowering entire data teams (both data scientists and data analysts), and abstract away a lot of the complexity and tediousness involved in handling the entire lifecycle of data (for a great overview, see this video of a presentation by Florian Douetteau, CEO at of Dataiku) [Disclaimer: FirstMark is an investor in Dataiku].
The cloud providers are of course active, with Microsoft’s Learning Studio, Google’s Cloud AutoML and AWS Sagemaker. Despite the might of the cloud providers, those products are still reasonably narrow in their scope – generally hard to use and largely targeting very technical, advanced users. They’re also still very much nascent. Sagemaker, Amazon’s cloud machine learning platform, reportedly had a slow start in 2018, with only $11M in sales to the commercial sector.
Some cloud providers are actively partnering with pure play players in the space: Microsoft participated in the $250M Series E of Databricks, perhaps a prelude to a future acquisition.
Beyond the enterprise AI platforms, the world of horizontal AI (including computer vision, NLP, voice, etc.) continues to be incredibly vibrant.
We had covered the world of AI research in a previous post: Frontier AI: How far are we from artificial “general” intelligence, really?.
Since that post, some of the key trends in AI include:
- major improvements in NLP, particularly through the application of transfer learning (which involves training a model on a large amount of data, and the porting it and fine-tuning it for the specific problem one is working on) to make it work with less data: see ELMO, ULMFit and, most importantly, BERT from Google AI
- More efforts to make AI work with less data, including 1-shot learning
- combining deep learning with reinforcement learning
- continued progress in GANs
As we complete our journey through the 2019 landscape from the left to the right of the chart, a couple of key trends to highlights in applications:
- ML/AI hits the deployment phase in the enterprise
- The rise of enterprise automation and RPA
At this stage, we are probably 3 or 4 years into a journey of trying to build ML/AI applications for the enterprise.
There were certainly some awkward product attempts (first generation chatbots) and some big marketing claims well ahead of reality, especially from older companies trying to retrofit ML/AI into existing products.
But, bit by bit, we’ve entered the deployment phase of ML/AI in the enterprise, going from curiosity and experimentation to actual use in production. The trend for the next few years seems clear: take a given problem, see if ML/AI (more often than not, deep learning, or a variation thereof) can make a difference, and if so, build an AI application to address the problem more effectively.
This deployment phase will occur in a variety of ways. Some products will be built and deployed by internal teams using the enterprise AI platforms mentioned above. Others will be full-stack products with embedded AI, offered by various vendors, where the AI part might be largely invisible to the customer. Yet others will be provided by vendors offering a mix of products and services (for an example of this approach, see this talk by Jean-Francois Gagne, CEO of Element AI).
Certainly, it is still very much early days. Internal teams often started with discreet projects addressing one use case (e.g., churn prediction), and are starting to expand to other problems. Many startups building ML/AI applications are still learning about the challenges of going from R&D mode to a fully scaled out operation (I wrote a few thoughts on the topic in this earlier blog post: Scaling AI Startups).
However, maturity is coming. There’s been a tremendous amount of learning for anyone deploying ML/AI in real life applications in the last few years, about what the technology can and cannot do, and we are starting to get a better sense for the right allocation of tasks between the machine and the human. See this talk by Dennis Mortensen, CEO of x.ai, about lessons learned building one of the first AI applications out there. Next generation customer service chatbots, for example, offer a much smarter mix between ML/AI and configurability and transparency, for the ultimate benefit of end users. See this great talk on the topic by Mike Murchison, CEO of Ada, an emerging leader in Automated Customer Experience at Data Driven NYC. [Disclaimer: FirstMark is an investor in both x.ai and Ada]
Projecting into the future, as ML/AI gradually becomes pervasive with the support of an increasingly high performance data stack, are we seeing the dawn of the fully automated enterprise?
Since Information Technology appeared, enterprises have been plagued by siloisation, with various systems and data spread across departments, unable to communicate with each other (which gave rise to the massive system integration services industry), and humans acting as “glue” in between. In a world where data and systems become increasingly integrated, and ML/AI enables to gradually remove humans from certain functions, it becomes more possible than ever to imagine enterprises functioning in an increasingly automated, systematic way.
For example, imagine an automated enterprise where an increase in demand (predicted via ML) automatically triggers an increase in order from suppliers, which would be automatically recorded in the finance system (which could automatically compute and pay compensation bonuses, etc.); or an anticipated decrease in demand could automatically trigger a corresponding increase in performance marketing spend, etc.
There is a futuristic world where enterprises become not only fully automated organizations, but eventually also self-healing and autonomous, a topic which we had explored in our presentation on AI and blockchain last year.
However, we’re far from that stage, and today’s reality is largely focused on RPA. This is a red hot category, with leaders such as UI Path and Automation Anywhere growing very fast and raising mega-rounds, as mentioned above.
RPA, short for Robotic Process Automation (although, perhaps disappointingly, it does not leverage any actual robot), involves taking generally very simple workflows, typically manual (performed by humans) and repetitive, and replacing them by software. A lot of RPA takes place in back office functions (e.g., invoice processing).
RPA is propelled by a very strong tailwind around digital transformation that has been accelerating over the last couple of years in particular. Several RPA leaders had been around for years (UiPath was founded in 2005), but “suddenly” hit hockey stick growth when digital transformation became the topic du jour. It also offers a strong ROI as its implementation can be directly compared to the cost of humans performing the same task. RPA is also very attractive to the tech services behemoths because it involves a large amount of implementation services (as the software needs to be configured for a myriad different workflows); therefore RPA startups have benefited from strong partnerships with those large services firms.
There are perhaps reasons to be cynical about RPA. Some consider it to be largely unintelligent “band aid”, or a stopgap measure of sorts – take an inefficient workflow performed by humans, and just have the machine do it. From that perspective, RPA may be simply creating the next level of technical debt, and it is unclear what happens to automated RPA functions as the environment around them changes, other than leading to the need to more RPA to reconfigure the old task to its new environment. RPA, at this stage at least, is more about automation than intelligence, more about rules-based solutions than AI (although several RPA vendors taut their AI capabilities in marketing materials).
RPA should be distinguished from intelligent automation, which is a more emerging category centered around ML/AI. Intelligent automation also targets enterprise processes and workflows, but it is more data centric than it is process centric, and will ultimately be able to learn, improve and heal.
One example of intelligent automation is intelligent document processing (ADP), a category where ML/AI can be leveraged to understand documents (forms, invoices, contracts, etc.) at levels comparable or better than humans, except at massive scale. See this talk by Hyperscience at Data Driven NYC for more context [disclaimer: FirstMark is an investor in HyperScience].
It will be particularly interesting to observe those spaces in the next few years, and it is possible that RPA and intelligent automation will merge, either through M&A or through the launch of new homegrown products, unless the latter progresses so rapidly that is limits the need for the former.
1) As every year, we couldn’t possibly fit all companies we wanted on the chart. While the general philosophy of the chart is to be as inclusive as possible, we ended up having to be somewhat selective. Our methodology is certainly imperfect, but in a nutshell, here are the main criteria:
- Everything being equal, we gave priority to companies that have reached some level of market significance. This is a reasonably easy exercise for large tech companies. For growing startups, considering the limited amounts of data available, we often used venture capital financings as a proxy for underlying market traction (again, probably imperfect). So everything else being equal, we tend to feature startups that have raised larger amounts, typically Series A and beyond.
- Occasionally, we made editorial decisions to include earlier stage startups when we thought they were particularly interesting.
- On the application front, we gave priority to companies that explicitly leverage Big Data, machine learning and AI as a key component or differentiator of their offering. it is a tricky exercise at a time when companies are increasingly crafting their marketing around an AI message, but we did our best.
- This year as in previous years, we removed a number of companies. One key reason for removal is that the company was acquired, and not run by the acquirer as an independent company.. In some select cases, we left the acquired company as is in the chart when we felt that the brand would be preserved as a reasonably separate offering from that of the acquiring company.
2) As always, it is inevitable that we inadvertently missed some great companies in the process of putting this chart together. Did we miss yours? Feel free to add thoughts and suggestions in the comments.
3) As we get a lot of requests every year: feel free to use the chart in books, conferences, presentations, etc – two obvious asks: (i) do not alter/edit the chart and (ii) please provide clear attribution (Matt Turck, Lisa Xu and FirstMark Capital).
4) Disclaimer: I’m an investor through FirstMark in a number of companies mentioned on this 2019 Data & AI Landscape, specifically: ActionIQ, Ada, Cockroach Labs, Dataiku, Frame.ai, Helium, HyperScience, Kinsa, Text IQ, Timber, Sense360 and x.ai. Other FirstMark portfolio companies mentioned on this chart include Bluecore, Engagio, Graffiti, HowGood, Payoff, Knewton, Insikt, Optimus Ride, and Tubular. I’m a small personal shareholder in Datadog.