As the interest in all things big data continues to increase, I’ve had a few chats recently with executives and entrepreneurs looking to learn more about the space, where I was asked about the trends and opportunities I see. Wide topic obviously, but I figured I’d jot down a few notes about what I’ve been hearing, reading and thinking.
I see opportunities in the data space unfolding in several “waves.” Of course, reality often resists attempts at this type of categorization, and it is unlikely those waves will happen in a neatly organized sequential order; elements of each wave already exist, and it’s possible all of this will happen more or less at the same time. However, I still find it helpful to have this type of framework in mind to understand an industry that is rapidly changing.
First wave: Big data infrastructure
Right now, the whole “big data” discussion is very much about core technology. Look up the agendas for big data conferences like Strata or Structure:Data and you’ll see – it’s all about software and data science, fascinating stuff but very technical and generally hard to understand for anyone that’s not deeply versed into those topics. Core big data technologies may have originated from consumer internet companies, but at this stage there’s not much that feels “consumery” about big data.
The reason for this is that we’re still early in building the big data infrastructure, and there’s a lot to figure out, before much else can happen. If the fundamental premise of big data – that all current solutions break past a certain volume (or velocity or variety) of data – holds true, then a whole new ecosystem needs to be reinvented. We’ve made a lot of progress in the last few years, but there are still a lot of nuts to crack, for example: How do you process big data in real time? How do you clean up large data sets at scale? How do you transfer large volumes of data to the cloud and process it there? How do you simplify big data tools to make them approachable by a larger number of software engineers and business users?
As a result, much of the innovation has been happening at the infrastructure level. Note that I mean “infrastructure” in the broadest sense – basically all the pieces of “plumbing” necessary to process big data and derive insights from it. That includes infrastructure per se (for example, the Clouderas and Hortonworks of the world, the various NoSQL companies, etc.), but also analytics (Platfora, Continuuty, Datasift, etc.), data marketplaces (Factual, Datamarket, etc.), crowdsourcing players (Kaggle, CrowdControl, etc.) and even devices (sensors, personal data capture devices).
This is a time of tremendous opportunities for new entrants. Large technology vendors are going to struggle with big data, in part because the underlying technologies are very different, and in part because they’ve been making a lot of money so far selling expensive solutions to process comparatively smaller data sets – some of the new entrants claim to be up to an order of magnitude cheaper than the Oracles of the world. Large companies have made some interesting moves (Oracle partnering with Cloudera, Microsoft announcing support for Hadoop) but presumably, they will delay the inevitable for the most part, and this will lead to plenty of attractive acquisition opportunities for startups and their investors over the next few years.
Equally, it is also a time of confusion for anyone trying to figure out who the real success stories will be:
- There’s a lot of noise, and this is only going to accelerate as VC money continues to pour into the industry. Also, the fact that older, larger companies seem to be racing to rebrand as big data companies doesn’t help.
- There’s a fair number of “science projects” out there – companies that, at least for now seem to be focused on solving an engineering issue but haven’t quite thought through their commercial applicability. At our recent NYC Data Business Meetup, Kirill Sheynkman of RTP Ventures made a powerful case that big data for big data’s sake does not a company make (“Big data… so what?”)
- It is going to take a while for winners to emerge – unlike consumer internet startups that can experience hockey stick growth from inception, software startups go through generally slower adoption cycles (consumerization of IT notwithstanding). Also, the abundantly documented (but presumably temporary) shortage of Hadoop engineers and data scientists may somewhat slow down the widespread adoption of those technologies.
- The surge in interest about all things big data will inevitably lead to some level of disillusionment (I assume Gartner has a nice hype cycle chart describing this), as projects turn out to be harder and more time consuming than expected, and sometimes underwhelm their sponsors. Startups will have to struggle through that phase, which may slow things down as well
Sooner or later, of course, winners will emerge, and what seems to us like daunting technical challenges will become something that any qualified software engineer will be able to handle, equipped with reasonably simple and cheap tools. There’s always a slight irony to underlying technologies: their ultimate sign of success is that at some point they become a given, a starting point, a simple enabler. In a recent talk organized about the NYC Media Lab and held at Bloomberg, Hilary Mason mentioned that the future of data visualization is “boring”, meaning that it will eventually become commoditized and a simple tool. I believe this will probably be true eventually of the entire big data technology stack.
Second wave: “Big data enabled” applications and features
As core infrastructure issues are gradually being resolved, the next logical step is to focus on expanding the benefits of big data to a broader, non-technical audience within the enterprise, and to more consumers online.
Within the enterprise, we should see a lot of innovation around business applications. Enterprise software has always been to a large extent about enabling business end users to access and manipulate large amounts of data. “Big data enabled” enterprise applications will take this to the next level, offering business users unprecedented data mining and analysis opportunities, using larger volumes of internal data, in real time or close, and sometimes augmenting it with external data sets available through data marketplaces. This will happen across many different enterprise functions (finance, sales, marketing, HR, marketing, etc.) and across industries, from retail to healthcare to financial services.
The possibilities are intriguing: for example, what will a CRM application look like, when you can mine in real time all of your customer base, the interactions of your sales force with them, and combine the results with external data sets on industry and company news, geographic and demographic patterns, to determine which prospects are the most likely to buy in the next quarter? Enterprise marketing software is also likely to be profoundly impacted by big data.
On the downside, things may take a little while longer than one would like here as well. In enterprise software, it’s not just the quality of the software that counts. Business end users need to accept the new product, learn how to use it, and integrate it in their daily process and workflow. Big data applications will be no exception to this.
One thing big data vendors can do to speed up the adoption cycle is to focus on the simplicity of the end user experience. From that perspective, startups like Splunk and Datadog are showing the way, in the IT data space– Splunk enables end users to search large amounts of data through a Google-like interface; Datadog enables users to monitor data through an experience that’s very reminiscent of the Facebook newsfeed.
On the consumer internet front, data-driven features should become commonplace on many websites. Internet startups led the way, in particular with their recommendation engines (Amazon, LinkedIn, Netflix, Facebook, iTunes, in particular). But so far those features have required having first-rate data scientists on board, and an ad hoc infrastructure. I would expect all of this to democratize considerably in the near future, as the infrastructure evolution mentioned above takes place. Retailers, financials services companies, health care providers will all use data-driven features to customize and personalize their users’ online experience, and accelerate their core business. As over time any company with a web presence will want to offer data-driven features, there is an interesting market opportunity for startups that could provide easy to use, out of the box tools to do this easily (“big data out of the box”)
Third wave: The emergence of “big data enabled” startups
The democratization of big data infrastructure tools will also open wide the opportunity for entrepreneurs, including those without a deep tech background, to dream up entire new businesses (and business models) based on data.
Just the way we were talking about “web enabled” businesses a few years ago, we’re likely to see more and more “big data enabled” businesses appearing. By that I mean companies that have the ability to process large amounts of data as their core DNA, and use it to deliver a product or service that could not exist otherwise.
Of course, there are already a number of startups that live and breathe data. I believe that Klout, for all the controversy around it, is a category-defining startup, and a great example of a company computing large amounts of data to come up with a unique product. Billguard is a very interesting play that combines big data and crowdsourcing to deliver real value to consumers. Foursquare also comes to mind — a key insight of Dennis Crowley’s interview at SxSW a few days ago was how much he thinks of his company as a data play (gamification being “just an onboarding mechanism’).
This only the beginning. As always, there are a number of tricky issues to deal with (privacy being one of them), but it’s going to be a lot of fun to see what ideas we all come up with. As an example, I’m fascinated by “big data enabled” startups that empower consumers, such as:
- Personal data companies: as the number of inputs of personal data increases (social network activity, personal health devices like Fitbit and Jawbone, etc.), I believe there are going to be exciting opportunities for startups that can aggregate and analyze one’s personal data, visualize it and compare it to peers in a simple and visually attractive way. Think of what Stephen Wolfram has been doing for years, but as a consumer friendly product available to all: self-quantification gone mainstream.
- “Consumer to business” (C2B) companies: Individual data capture will give people more power when it comes to obtaining customized treatment for businesses. Startups like Milesense capture your driving behavior through your iPhone so that you can obtain better insurance premiums if you’re a safer driver. Similarly, if you can capture one’s health and diet habits, and you are healthy, you should obtain better prices for health and life insurance. What else can the consumer obtain, once she is empowered with her own data?