Big Data 101 Presentation

A few weeks ago, I was invited to do a couple of guest lectures at NYU (as part of the excellent “Ready, Fire, Aim” entrepreneurship class that Lawrence Lenihan, now my partner at FirstMark, has been doing for a while there) and at The New School (as part of a Big Data course organized by Debra Anderson and Greta Knutzen).  Thought I’d share the slide deck I had prepared for those classes.  Very much a Big Data 101 class for a college-level audience that had had little or no exposure to the key concepts prior to the class.

Quantopian, Plaid and ZestFinance

Our February NYC Data Business Meetup was focused on the intersection of data and finance (both market and consumer finance).  Quantopian, Plaid and ZestFinance presented.

We also had a great panel presenting the customer perspective on Big Data (hype vs. reality), from a financial institutions’ viewpoint, with the following speakers:  Mike Simone (Global Head of CitiData Platform Engineering), Emile Werr (Head of Enterprise Data Architecture, NYSE EuroNext) and  Raj Patil (up until recently Data innovation CTO at UBS, now an entrepreneur).  Unfortunately, due to standard policy at some of those institutions, we can’t publicly post the video of the panel.

The slides are here: 

Quantopian

Plaid

ZestFinance

Here are the videos, in order of appearance (we also had a great “customer panel

Bloomberg App Portal:

Quantopian:

Plaid:

ZestFinance:

Panel:

Joseph Turian, Sqrrl, Infochimps and MemSQL

The December NYC Data Business Meetup was focused on big data infrastructure companies, with the co-founders of Sqrrl, Infochimps and MemSQL presenting to a full house.  We started the evening with a presentation by prominent data scientist Joseph Turian.

The slides are here: Joseph TurianSqrrlInfochimps and MemSQL.

Here are the videos:

Intro

 

Joseph Turian, “How to do AI in 2013”

 

Oren A. Falkowitz, Co-Founder & CEO, Sqrrl

 

Dhruv Bansal, Co-Founder & Chief Science Officer, Infochimps

 

Eric Frenkiel, Co-Founder & CEO, MemSQL

 

And here are a few pics (photo credit: Shivon Zilis):

 

Recorded Future, Lex Machina, DataMarket and numberFire

The November NYC Data Business Meetup was focused on “vertical-specific” applications of big data – startups leveraging the big data stack to offer new solutions to specific industries, such as finance and government (Recorded Future), the legal industry (Lex Machina), energy (DataMarket, although it offers data sets for other industries as well) and sports (numberFire).

The slides are here: Recorded FutureLex MachinaDataMarket and numberFire.

Here are the videos:

Christopher Ahlberg, CEO, Recorded Future:

 

Josh Becker, CEO, Lex Machina:

 

Hjálmar Gíslason, CEO, DataMarket:

 

Nik Bonaddio, CEO, numberFire:

 

Panel discussion:

 

Some pics:

IA Ventures, Accel, Data Collective, Precog and CCS at the NYC Data Business Meetup

Here are the videos from the NYC Data Business Meetup that was held on October 23, 2012, in order of appearance:

Jeff Carr, COO, Precog

 

Max Yankelevich, co-founder, CrowdComputing Systems

 

Roger Ehrenberg, Founder and Managing Partner, IA Ventures; Ping Li, General PartnerAccel Partners; Matt Ocko, Co-Founder and Partner, Data Collective (from left to right):

 

A chart of the big data ecosystem, take 2

So here we are again.  My colleague Shivon and I had made a first attempt at making sense of the rapidly evolving big data ecosystem back in June.  Based on some very helpful feedback from readers of this blog and others, a number of additional meetings with interesting startups and more in depth research, we’ve come up with this second version.

Some thoughts:

  • It’s still a work in progress (and will presumably always be, that’s the nature of the beast)
  • It’s even more crowded than the first time around, which reflects the incredible vitality of the big data space
  • We’ve created some new subcategories such as NoSQL/NewSQL and analytics services (reflecting the reality that, for the time being, the last mile of data analysis is very much performed by humans)
  • We have the occasional company that appears in different categories (Infochimps or Autonomy for example)
  • We have learned more about companies that were already on the first version of the chart, and have positioned them differently.  For example, Metamarkets now falls in the “Cross Infrastructure/Analytics” category as they offer a stack that includes a data store (Druid), predictive analytics and visualization.  Another example is Collective[i] – they have built  an entire proprietary big data stack from the ground up, that includes infrastructure, analytics and applications – making the company a rare example of an “Application Service Provider”.

Our goal is to continue updating this chart from time to time, and perhaps make it evolve visually, as we’ve probably reached the limits of what we can reasonably fit on one slide.  It was suggested that we try to visually distinguish on premise offerings vs. cloud based solutions, which we may try to do.

To enlarge, click on the arrows at the bottom right of the chart.

Comments, thoughts, questions? Please add to the comments section.

10Gen, Mortar, Datadog & Rick Smolan at the NYC Data Meetup

Here are the videos and some pictures (scroll down) of the NYC Data Business Meetup held on September 25, 2012

In order of appearance:

1) Rick Smolan told us about his fascinating new project, the “Human Face of Big Data” – see the NY Times coverage here: http://nyti.ms/TO5MDd.

 

2) Mortar (presenter: K Young, CEO). Mortar (www.mortardata.com) provides a platform-as-a-service for Hadoop.  They take care of all of the necessary infrastructure (via AWS) and allow any software engineer to run jobs on Hadoop using Apache Pig and Python without special training.

 

3)  Datadog (presenter: Alexis Le Quoc, co-founder). Datadog (www.datadoghq.com) is a service for IT, Operations and Development teams who write and run applications at scale, and want to turn the massive amounts of data produced by their apps, tools and services into actionable insight.  Datadog helps software developers and web ops understand their IT Data by putting it all in context.

 

4) We finished with a fireside chat with Dwight Merriman, CEO and co-founder, 10Gen. 10Gen (www.10gen.com) develops MongoDB, and offers production support, training, and consulting for the open source database. Dwight is one of the original authors of MongoDB. In 1995, Dwight co-founded DoubleClick (acquired by Google for $3.1 billion) and served as its CTO for ten years. Dwight was the architect of the DoubleClick ad serving infrastructure, DART, which serves tens of billions of ads per day. Dwight is co-founder, Chairman, and the original architect of Panther Express (now part of CDNetworks), a content distribution network (CDN) technology that serves hundreds of thousands of objects per second. Dwight is also a co-founder and investor in BusinessInsider.com and Gilt Groupe.

 

Continuuity, Sailthru & Visual Revenue at the NYC Data Meetup

Here are the some videos, slides and pics from the most recent NYC Data Business Meetup.  The videos are unfortunately not of the greatest quality, but are good enough to watch.

Also, note to self: make sure that our audience of 200+ sits closer to the stage, so that the room doesn’t look tragically empty on camera (rookie mistake)!

In order of appearance:

1) Todd Papaioannou, CEO, Continnuuity, a stealth big data startup, based in Palo Alto, CA and backed by Andreessen Horowitz, Battery Ventures, Data Collective and a number of high profile angels. Todd was previously Chief Cloud Architect for Yahoo.

2) Neil Capel, CEO, and Daniel Krasner, Chief Data Scientist, Sailthru, a New York based startup backed by RRE, AOL Ventures, Lerer Ventures, DFJ Gotham, Thrive Capital, Metamorphic, etc.  Sailthru provides fully automated, 1:1 email and onsite recommendations using a unique behavioral targeting platform. Sailthru helps brands cut through the clutter and build trust with their customers by recognizing and acting upon their individual interests. Sailthru’s technology creates individual user profiles associated with each person’s email address and online behavior. Sailthru’s algorithms gauge each individual user’s intent and match appropriate content and frequency of email communications such that every email is tailored to the unique user. That means they send as many permutations of an email as there are recipients. All simultaneously, all automated and all in real time.

Sailthru’s slides (PDF)

3) Dennis R. Mortensen, CEO and Jeroen Janssens, Data Scientist,Visual Revenue, a New York based startup backed by Lerer Ventures, SV Angel, IA Ventures and Softbank. Visual Revenue increases front page performance for online media organizations.  Their platform provides Editors with actionable, real-time recommendations on what content to place in what position right now and for how long. Visual Revenue’s predictive analytics technology allows media organizations to proactively manage the cost of exposing a piece of content on a front page, whilst maximizing the return they expect from promoting it.

Visual Revenue’s slides

4) Panel discussion and Q&A with the audience

 

Data-driven venture capital

I have been very intrigued by the recent emergence of “data driven” firms, aiming to use data to reinvent venture capital.

While they certainly review various data points and metrics before deciding to invest in a startup, as of today venture capital investors largely operate based on “pattern recognition” – the general idea being that, once you’ve heard thousands of pitches, sat on many boards and carefully studied industries for years, you become better than most at predicting who will make a strong founder/CEO, what business model will work and eventually, which startup will end up being a home run.  The trouble is, the model doesn’t always work, far from it, and many VCs end up making the wrong bets, resulting in disappointing overall industry results.  Could VCs be just like the baseball scouts described in Moneyball, who think they can spot future superstars because they’ve seen so many of them before, but end up being beaten by a cold, objective, statistics-based approach?

Enter several firms trying to do things differently:

  • Google Ventures has created various data-driven algorithms that inform their investment decisions – see the team discussing the concept at last year’s Web 2.0 Summit here.
  • Correlation Ventures raised $165M earlier this year for its first fund, which was reportedly oversubscribed (a rarity for a new fund).  Correlation says it has built the “world’s largest, most comprehensive database of U.S. venture capital financings”, which covers “the vast majority of venture financings that took place over the past two decades, tracking everything from key financing terms, investors, boards of directors, management backgrounds, industry sector dynamics and outcomes”.  Based on this data, Correlation has developed predictive analytics models which it uses to guide its investment decisions – as a result, it can make decisions very quickly (less than two weeks) and doesn’t require additional due diligence.
  • Just earlier this week, E.ventures (which results from the relaunch of BV Capital) also emphasized its own data-driven approach to investment decisions

Since I’m a big fan of anything data-driven (decisions, product, companies), the concept resonates strongly with me.  Predictive analytics have been successfully used in various industries, from retail to insurance to consumer finance.  Other asset classes are highly data driven – fundamental and technical analysis drive billions of dollars of trade; hedge fund quants spend their lives building complex models to price and trade securities; high-frequency trading bypasses human decision making altogether and invests gigantic amounts of money based solely on data.  In this world where everything gets quantified, why should venture capital be an exception?

However, as much as I like the idea, I believe venture capital doesn’t lend itself very well to a model-heavy, quasi “black box” approach.  The creation of a reliable, systematic predictive model is a particularly challenging task when you consider the following obstacles:

  • A relatively sparse data set: while by definition there’s not much data about early stage startups, you could argue that that amount is constantly increasing, as everything is moving online, and everything online can be measured.  You could also argue that, if you could have access to all historical data from all VC firms in the country, and efficiently normalize it, you would end up with a lot of data.  But still that amount of data would pale in comparison to what’s available to public market investors – Bloomberg processes up to 45 billion “ticks” (change in the price of a security)… daily.
  • Limited intermediary feedback points: Before getting to a final outcome (game lost or won), baseball is full of small binary outcomes (a player hits the ball or he doesn’t).  Similarly, in market finance, the eventual success of strategy can typically be broken down in many different points with binary outcomes (you make money or you don’t).  In venture capital, before getting to a final outcome (a startup has a liquidity event), it’s unclear how many of those intermediary, measurable points you get, that can enable you to build models – perhaps a few (the startup’s next round is an “up round” or a “down/flat round”) but certainly nothing compared to the above examples.
  • Extended time horizon: in baseball, the rules of the game do not change from game to game, or season to season.  In venture capital, the “game” can last for years, because investments are highly illiquid.  During that time, pretty much anything can change – regulatory framework, unforeseen disruptive forces in the industry, etc.

In addition, it would be interesting to see how startups react in the long run to investors who are interested in them mostly because they scored well on a model, as opposed to spending extended time getting to know them.  Unlike public stock markets, venture capital fundraising is a two-way dance, and startups often pick their investors as much as their investors pick them.

However, while I have my doubts about using data models as valid predictors of the overall success of an early stage startup, my guess is that there are still plenty of interesting insights to be gleaned from the data, and that forward-thinking VC firms could gain a competitive advantage by actively crunching it  – my sense is that very few firms have done so at this stage.

Interestingly, there are some good data sources and emerging technologies out there that could be leveraged as a first step, without engaging into a massive data gathering or technology development effort:

  • Public (and/or free) sources:  Crunchbase is a great source of data.  There are many directions you could go with mining it – as an example, see what Opani (an early stage NYC big data company) came up with here. I bumped into Semgel, a web app that has taken a stab at instantly gathering and analyzing Crunchbase data.  The Crunchbase data could be augmented with data from marketplaces such as Factual.  See also this intriguing article about pre-money valuations of startups (typically not information that’s disclosed) could possibly be mined from publicly available Delaware certificates of incorporation and similar documents in other states.
  • Private Databases: There a few interesting databases that collect and organize more complex information flows around private companies such as CB Insights (which also offers a data-driven tracking tool called Mosaic)
  • Technologies: In addition to the various open-source big data tools, there are some technologies/companies that could be leveraged to mine VC industry data, including for example Quid, co-founded by the talented Sean Gourley – “understanding co-investment relationships and deriving investment strategies” is one the challenges they address.

If anyone is aware of other efforts around crunching data relevant to VCs, or other ways VCs have been used a heavily data-driven approach, I’d love to hear about it in the comments.

A chart of the big data ecosystem

My colleague Shivon Zilis has been obsessed with the Terry Kawaja chart of the advertising ecosystem for a while, and a few weeks ago she came up with the great idea of creating a similar one for the big data ecosystem.  Initially, we were going to do this as an internal exercise to make sure we understood every part of the ecosystem, but we figured it would be fun to “open source” the project and get people’s thoughts and input.

So here is our first attempt.

A few things became apparent very quickly:

1) Many companies don’t fall neatly into a specific category

2) There’s only so many companies we can fit on the chart — subcategories as NoSQL or advertising applications, for example, would almost deserve their own chart.

3) The ecosystem is evolving so quickly that we’re going to need to update the chart often – companies evolve (e.g., Infochimps), large vendors make aggressive moves in the space (VMWare with Serengeti and the Citas acquisition)

What do you think? (click on the bottom right to expand)

“The business of data” panel with IA Ventures, Klout & Quid

Just got a copy of the video of a panel we did a few months ago at the Bloomberg Link Empowered Entrepreneur conference.  It features Roger Ehrenberg of IA Ventures, Joe Fernandez of Klout and Sean Gourley of Quid.  The speakers are terrific and it’s a solid introduction to the topic — since this panel was part of a broader entrepreneurial conference, it is slightly higher level than panel conversations you’d hear in specialized Big Data conferences.

The thriving data ecosystem in NYC

There’s a lot of interest in data-related businesses and products everywhere these days, but it’s been particularly fun to see things accelerating in New York (where I’m based).  Some purely anecdotal evidence: We had 50 very qualified data scientists show up at the recent hackathon we organized (as part of Big Data Week), despite the ungodly start time of 8am on a Saturday.   The Data Meetup I host monthly went from 0 to almost 1,300 members in barely 5 months.  General Assembly is starting a 10 week intensive program in data science.  Microsoft just announced it chose to locate in NYC its new research lab, which includes plenty of data science brainpower (including machine learning specialist John Langford and Jake Hofman, formerly of Yahoo Research).

NYC is becoming a real “hub” for data startups.  In fact, in my opinion data startups are becoming the next “layer” of the NYC tech scene — the way content and advertising startups (24/7, Doubleclick, Silicon Alley Reporter, etc.) were the foundational layer of “Silicon Alley” from 1995 to 2005, and the way social and e-commerce startups (Tumblr, Gilt, Foursquare, Etsy, Warby Parker, Rent the Runway, etc.) became the next building block that led to where we are today.

Due to their often intensely technical nature, data startups represent an interesting opportunity for NYC to develop more of a scientific and engineering-focused startup culture.

NYC has the key components of a thriving data startup ecosystem, including:

1) Customer demand: For those startups that sell to enterprises rather than consumers, NYC is where many of the key buyers are located – specifically, Wall Street and Madison Avenue, which have been among the most voracious and sophisticated users of data.  It’s no accident that some of the key conferences in the space, such as GigaOm’s Structure:Data or Strata, take place in NYC (or have an NYC event in addition to their CA event) – there’s no better place for emerging vendors to show off what they’ve built to potential purchasers.

2) A relevant talent pool: in addition to solid engineering talent, data-driven startups need data scientists, who come in various flavors: statisticians, mathematicians, machine learning experts, programmers, etc.  In part because there has been demand for this type of profiles for a while in financial services, there’s a fair concentration of them in NYC, and I’m seeing an increasing number of them making the jump to startup land.  NYC has a number of prominent data scientists, including (but certainly not limited to), Drew Conway and Jake Porway (both of whom are co-founders of Datakind, f/k/a Data without Borders), Max Shron, Cathy O’Neil (who left D.E. Shaw for a startup, Intent Media), Gilad Lotan, etc.  And of course, we have our very own emerging media star (deservedly so) in the person of Hilary Mason, most recently profiled here.

3) A data community:  Whether it’s Data Drinks or meetups, there’s clearly appetite for data nerds to get together and geek out. Both the NYC Predictive Analytics meetup (organized by Alex Lin) and the NYC Machine Learning meetup (organized by Paul Dix and Max Khesin) have over 2,000 members, while the New York Open Statistical Programming Meetup has 1,700 members.

4) Investors with a deep interest in the space:  As far as I know, IA Ventures is the only VC firm in the country that has an exclusive focus on data as an investment thesis (Accel’s big data fund is a little different, in that it’s a dedicated pool out of a much larger fund).  Roger Ehrenberg and his talented team (Brad Gillespie, Ben Siscovick, Jesse Beyroutey) are having a tremendous impact on the data world in general, and in NYC in particular (about half of their portfolio is NYC-based). RTP Ventures is a new but very promising NYC investor in the space, with a focus on the infrastructure part of the big data world.  Many of the main NYC investors are also “data friendly”, and have interesting data plays in their portfolio, as part of a broader focus: Union Square Ventures, Betaworks (see John Borthwick’s “data is the new plastic“), RRE, Lerer Ventures, Thrive Capital, kbs+ Ventures, but I’m sure I’m forgetting a number of others.

5) Universities that are willing to get involved:  The key machine learning centers in the country may be Carnegie Mellon, MIT and Stanford, but Columbia is strong as well, and most importantly, there are some terrific professors who are both academically prominent and deeply involved in the NYC tech scene – in particular Chris Wiggins (in addition to being a prominent machine-learning expert, Chris is also the co-founder of HackNY and has mentored many of the data scientists currently employed in NYC startups) and Tony Jebara (who runs the Columbia Machine Learning Laboratory and has also founded and advised several startups including Sense Networks and Bookt).  NYU has some leading authorities the data-intensive field of physical computing and Internet of Things: Tom Igoe and Dan O’Sullivan. Medium term, Cornell may be able to bring some additional academic expertise to NYC (for example, it is home to Joachims Thorsten who is arguably one of the top SVM researchers).

6) A crop of promising data startups:

  • A growing number of NYC based startups offer data and predictive analytics solutions – starting perhaps with Opera Solutions, which very people in the NYC tech scene had heard about until it raised a whopping $84 million in September 2011 from Silver Lake and Accel KKR (Opera Solutions employs some 150 data scientists, out of 400 employees).  In addition, NYC startups have been building all sorts of interesting data and analytics products for social media (Bitly, SocialFlow, Kno.des), news (Visual Revenue), finance (Dataminr), music (NextBigSound, which is moving to NYC), sports (Numberfire, and our own Bloomberg Sports) and of course advertising and marketing (Sailthru, Collective[i], Custora, PlaceIQ, YieldBot, Mediamath, m6d, 33across, Clickable, Buddy Media, etc.).
  • While we’re nowhere near the Silicon Valley on this front,  it’s great to see more big data infrastructure companies in NYC – some like 1010Data largely predate the whole big data craze; others have been appearing more recently, including FluidInfo, CrowdControl, Mortar Data (which is moving to NYC), Datadog, and of course 10Gen, whose MongoDB noSQL database is quickly becoming a must-have for a number of data-driven companies.
  • Finally, several exciting NYC startups are focused on the application of data to create disruptive products in various industries, such as education (Knewton) or consumer finance (Billguard, Bundle).
  • The fact that NYC recently saw a couple of acquisitions of data startups – Chris Dixon’s Hunch and Jordan Cooper’s Hyperpublic – doesn’t hurt either.

7) A data-centric business culture: perhaps it is because some of the key historical entrepreneurial successes in NYC were data companies (Bloomberg LP, Nielsen); or perhaps it is a reflection of the demand of East Coast investors who arguably tend to be very focused on metrics and business models (as opposed to pure vision)… but somehow, as far as I can tell, there’s always been a real culture around data and analytics in NYC.  Now increasingly, I hear CEOs of NYC startups present their companies as data companies, even those you wouldn’t necessarily suspect (recent examples include Dennis Crowley of Foursquare and Yaron Galai of Outbrain).  In addition, NYC startups have been quick to build data science teams, including many that don’t explicitly position “data” as a key part of their value proposition: Etsy, Gilt, The Ladders, GetGlue, Foursquare, Tumblr all have data scientists on board.

All of this is just a start, and I’m excited to see how it all progresses in the next few months and years.

The three waves of opportunities in big data

As the interest in all things big data continues to increase, I’ve had a few chats recently with executives and entrepreneurs looking to learn more about the space, where I was asked about the trends and opportunities I see.  Wide topic obviously, but I figured I’d jot down a few notes about what I’ve been hearing, reading and thinking.

I see opportunities in the data space unfolding in several “waves.”  Of course, reality often resists attempts at this type of categorization, and it is unlikely those waves will happen in a neatly organized sequential order; elements of each wave already exist, and it’s possible all of this will happen more or less at the same time.  However, I still find it helpful to have this type of framework in mind to understand an industry that is rapidly changing.

First wave: Big data infrastructure

Right now, the whole “big data” discussion is very much about core technology. Look up the agendas for big data conferences like Strata or Structure:Data and you’ll see – it’s all about software and data science, fascinating stuff but very technical and generally hard to understand for anyone that’s not deeply versed into those topics.  Core big data technologies may have originated from consumer internet companies, but at this stage there’s not much that feels “consumery” about big data.

The reason for this is that we’re still early in building the big data infrastructure, and there’s a lot to figure out, before much else can happen.  If the fundamental premise of big data – that all current solutions break past a certain volume (or velocity or variety) of data – holds true, then a whole new ecosystem needs to be reinvented.  We’ve made a lot of progress in the last few years, but there are still a lot of nuts to crack, for example: How do you process big data in real time?  How do you clean up large data sets at scale? How do you transfer large volumes of data to the cloud and process it there? How do you simplify big data tools to make them approachable by a larger number of software engineers and business users?

As a result, much of the innovation has been happening at the infrastructure level.  Note that I mean “infrastructure” in the broadest sense – basically all the pieces of “plumbing” necessary to process big data and derive insights from it.  That includes infrastructure per se (for example, the Clouderas and Hortonworks of the world, the various NoSQL companies, etc.), but also analytics (Platfora, Continuuty, Datasift, etc.), data marketplaces (Factual, Datamarket, etc.), crowdsourcing players (Kaggle, CrowdControl, etc.) and even devices (sensors, personal data capture devices).

This is a time of tremendous opportunities for new entrants.  Large technology vendors are going to struggle with big data, in part because the underlying technologies are very different, and in part because they’ve been making a lot of money so far selling expensive solutions to process comparatively smaller data sets – some of the new entrants claim to be up to an order of magnitude cheaper than the Oracles of the world.  Large companies have made some interesting moves (Oracle partnering with Cloudera, Microsoft announcing support for Hadoop) but presumably, they will delay the inevitable for the most part, and this will lead to plenty of attractive acquisition opportunities for startups and their investors over the next few years.

Equally, it is also a time of confusion for anyone trying to figure out who the real success stories will be:

  • There’s a lot of noise, and this is only going to accelerate as VC money continues to pour into the industry. Also, the fact that older, larger companies seem to be racing to rebrand as big data companies doesn’t help.
  • There’s a fair number of “science projects” out there – companies that, at least for now seem to be focused on solving an engineering issue but haven’t quite thought through their commercial applicability.  At our recent NYC Data Business Meetup, Kirill Sheynkman of RTP Ventures made a powerful case that big data for big data’s sake does not a company make (“Big data… so what?”)
  • It is going to take a while for winners to emerge – unlike consumer internet startups that can experience hockey stick growth from inception, software startups go through generally slower adoption cycles (consumerization of IT notwithstanding).  Also, the abundantly documented (but presumably temporary) shortage of Hadoop engineers and data scientists may somewhat slow down the widespread adoption of those technologies.
  • The surge in interest about all things big data will inevitably lead to some level of disillusionment (I assume Gartner has a nice hype cycle chart describing this), as projects turn out to be harder and more time consuming than expected, and sometimes underwhelm their sponsors.  Startups will have to struggle through that phase, which may slow things down as well

Sooner or later, of course, winners will emerge, and what seems to us like daunting technical challenges will become something that any qualified software engineer will be able to handle, equipped with reasonably simple and cheap tools.   There’s always a slight irony to underlying technologies:  their ultimate sign of success is that at some point they become a given, a starting point, a simple enabler.  In a recent talk organized about the NYC Media Lab and held at Bloomberg, Hilary Mason mentioned that the future of data visualization is “boring”, meaning that it will eventually become commoditized and a simple tool.  I believe this will probably be true eventually of the entire big data technology stack.

Second wave: “Big data enabled” applications and features

As core infrastructure issues are gradually being resolved, the next logical step is to focus on expanding the benefits of big data to a broader, non-technical audience within the enterprise, and to more consumers online.

Within the enterprise, we should see a lot of innovation around business applications.  Enterprise software has always been to a large extent about enabling business end users to access and manipulate large amounts of data.  “Big data enabled” enterprise applications will take this to the next level, offering business users unprecedented data mining and analysis opportunities, using larger volumes of internal data, in real time or close, and sometimes augmenting it with external data sets available through data marketplaces.  This will happen across many different enterprise functions (finance, sales, marketing, HR, marketing, etc.) and across industries, from retail to healthcare to financial services.

The possibilities are intriguing: for example, what will a CRM application look like, when you can mine in real time all of your customer base, the interactions of your sales force with them, and combine the results with external data sets on industry and company news, geographic and demographic patterns, to determine which prospects are the most likely to buy in the next quarter? Enterprise marketing software is also likely to be profoundly impacted by big data.

On the downside, things may take a little while longer than one would like here as well.   In enterprise software, it’s not just the quality of the software that counts.  Business end users need to accept the new product, learn how to use it, and integrate it in their daily process and workflow.  Big data applications will be no exception to this.

One thing big data vendors can do to speed up the adoption cycle is to focus on the simplicity of the end user experience.  From that perspective, startups like Splunk and Datadog are showing the way, in the IT data space– Splunk enables end users to search large amounts of data through a Google-like interface; Datadog enables users to monitor data through an experience that’s very reminiscent of the Facebook newsfeed.

On the consumer internet front, data-driven features should become commonplace on many websites.   Internet startups led the way, in particular with their recommendation engines (Amazon, LinkedIn, Netflix, Facebook, iTunes, in particular).  But so far those features have required having first-rate data scientists on board, and an ad hoc infrastructure.  I would expect all of this to democratize considerably in the near future, as the infrastructure evolution mentioned above takes place.  Retailers, financials services companies, health care providers will all use data-driven features to customize and personalize their users’ online experience, and accelerate their core business.  As over time any company with a web presence will want to offer data-driven features, there is an interesting market opportunity for startups that could provide easy to use, out of the box tools to do this easily (“big data out of the box”)

Third wave:  The emergence of “big data enabled” startups

The democratization of big data infrastructure tools will also open wide the opportunity for entrepreneurs, including those without a deep tech background, to dream up entire new businesses (and business models) based on data.

Just the way we were talking about “web enabled” businesses a few years ago, we’re likely to see more and more “big data enabled” businesses appearing.  By that I mean companies that have the ability to process large amounts of data as their core DNA, and use it to deliver a product or service that could not exist otherwise.

Of course, there are already a number of startups that live and breathe data. I believe that Klout, for all the controversy around it, is a category-defining startup, and a great example of a company computing large amounts of data to come up with a unique product.  Billguard is a very interesting play that combines big data and crowdsourcing to deliver real value to consumers. Foursquare also comes to mind — a key insight of Dennis Crowley’s interview at SxSW a few days ago was how much he thinks of his company as a data play (gamification being “just an onboarding mechanism’).

This only the beginning.  As always, there are a number of tricky issues to deal with (privacy being one of them), but it’s going to be a lot of fun to see what ideas we all come up with.  As an example, I’m fascinated by “big data enabled” startups that empower consumers, such as:

  • Personal data companies:  as the number of inputs of personal data increases (social network activity, personal health devices like Fitbit and Jawbone, etc.), I believe there are going to be exciting opportunities for startups that can aggregate and analyze one’s personal data, visualize it and compare it to peers in a simple and visually attractive way.  Think of what Stephen Wolfram has been doing for years, but as a consumer friendly product available to all: self-quantification gone mainstream.
  • Consumer to business” (C2B) companies: Individual data capture will give people more power when it comes to obtaining customized treatment for businesses.  Startups like Milesense capture your driving behavior through your iPhone so that you can obtain better insurance premiums if you’re a safer driver.  Similarly, if you can capture one’s health and diet habits, and you are healthy, you should obtain better prices for health and life insurance. What else can the consumer obtain, once she is empowered with her own data?