Data-driven venture capital

I have been very intrigued by the recent emergence of “data driven” firms, aiming to use data to reinvent venture capital.

While they certainly review various data points and metrics before deciding to invest in a startup, as of today venture capital investors largely operate based on “pattern recognition” – the general idea being that, once you’ve heard thousands of pitches, sat on many boards and carefully studied industries for years, you become better than most at predicting who will make a strong founder/CEO, what business model will work and eventually, which startup will end up being a home run.  The trouble is, the model doesn’t always work, far from it, and many VCs end up making the wrong bets, resulting in disappointing overall industry results.  Could VCs be just like the baseball scouts described in Moneyball, who think they can spot future superstars because they’ve seen so many of them before, but end up being beaten by a cold, objective, statistics-based approach?

Enter several firms trying to do things differently:

  • Google Ventures has created various data-driven algorithms that inform their investment decisions – see the team discussing the concept at last year’s Web 2.0 Summit here.
  • Correlation Ventures raised $165M earlier this year for its first fund, which was reportedly oversubscribed (a rarity for a new fund).  Correlation says it has built the “world’s largest, most comprehensive database of U.S. venture capital financings”, which covers “the vast majority of venture financings that took place over the past two decades, tracking everything from key financing terms, investors, boards of directors, management backgrounds, industry sector dynamics and outcomes”.  Based on this data, Correlation has developed predictive analytics models which it uses to guide its investment decisions – as a result, it can make decisions very quickly (less than two weeks) and doesn’t require additional due diligence.
  • Just earlier this week, E.ventures (which results from the relaunch of BV Capital) also emphasized its own data-driven approach to investment decisions

Since I’m a big fan of anything data-driven (decisions, product, companies), the concept resonates strongly with me.  Predictive analytics have been successfully used in various industries, from retail to insurance to consumer finance.  Other asset classes are highly data driven – fundamental and technical analysis drive billions of dollars of trade; hedge fund quants spend their lives building complex models to price and trade securities; high-frequency trading bypasses human decision making altogether and invests gigantic amounts of money based solely on data.  In this world where everything gets quantified, why should venture capital be an exception?

However, as much as I like the idea, I believe venture capital doesn’t lend itself very well to a model-heavy, quasi “black box” approach.  The creation of a reliable, systematic predictive model is a particularly challenging task when you consider the following obstacles:

  • A relatively sparse data set: while by definition there’s not much data about early stage startups, you could argue that that amount is constantly increasing, as everything is moving online, and everything online can be measured.  You could also argue that, if you could have access to all historical data from all VC firms in the country, and efficiently normalize it, you would end up with a lot of data.  But still that amount of data would pale in comparison to what’s available to public market investors – Bloomberg processes up to 45 billion “ticks” (change in the price of a security)… daily.
  • Limited intermediary feedback points: Before getting to a final outcome (game lost or won), baseball is full of small binary outcomes (a player hits the ball or he doesn’t).  Similarly, in market finance, the eventual success of strategy can typically be broken down in many different points with binary outcomes (you make money or you don’t).  In venture capital, before getting to a final outcome (a startup has a liquidity event), it’s unclear how many of those intermediary, measurable points you get, that can enable you to build models – perhaps a few (the startup’s next round is an “up round” or a “down/flat round”) but certainly nothing compared to the above examples.
  • Extended time horizon: in baseball, the rules of the game do not change from game to game, or season to season.  In venture capital, the “game” can last for years, because investments are highly illiquid.  During that time, pretty much anything can change – regulatory framework, unforeseen disruptive forces in the industry, etc.

In addition, it would be interesting to see how startups react in the long run to investors who are interested in them mostly because they scored well on a model, as opposed to spending extended time getting to know them.  Unlike public stock markets, venture capital fundraising is a two-way dance, and startups often pick their investors as much as their investors pick them.

However, while I have my doubts about using data models as valid predictors of the overall success of an early stage startup, my guess is that there are still plenty of interesting insights to be gleaned from the data, and that forward-thinking VC firms could gain a competitive advantage by actively crunching it  – my sense is that very few firms have done so at this stage.

Interestingly, there are some good data sources and emerging technologies out there that could be leveraged as a first step, without engaging into a massive data gathering or technology development effort:

  • Public (and/or free) sources:  Crunchbase is a great source of data.  There are many directions you could go with mining it – as an example, see what Opani (an early stage NYC big data company) came up with here. I bumped into Semgel, a web app that has taken a stab at instantly gathering and analyzing Crunchbase data.  The Crunchbase data could be augmented with data from marketplaces such as Factual.  See also this intriguing article about pre-money valuations of startups (typically not information that’s disclosed) could possibly be mined from publicly available Delaware certificates of incorporation and similar documents in other states.
  • Private Databases: There a few interesting databases that collect and organize more complex information flows around private companies such as CB Insights (which also offers a data-driven tracking tool called Mosaic)
  • Technologies: In addition to the various open-source big data tools, there are some technologies/companies that could be leveraged to mine VC industry data, including for example Quid, co-founded by the talented Sean Gourley – “understanding co-investment relationships and deriving investment strategies” is one the challenges they address.

If anyone is aware of other efforts around crunching data relevant to VCs, or other ways VCs have been used a heavily data-driven approach, I’d love to hear about it in the comments.

14 thoughts on “Data-driven venture capital”

  1. As you rightly point out, quantitative investing is really quite hard, even when trying to analyse equities with decades of financial reports available. The world is complex and always changing, with no guarantee that historical conditions will prevail. Furthermore, the data that you have is generally extremely sparse relative to the dimensionality of the problem, requires lots of massaging to normalize, and is full of wrinkles that must be ironed out to normalize it for comparison (retrospective corrections, stock splits and other corporate actions etc… etc…).

    The natural (and only) solution is to approach the problem with really strong priors in the form of a set of principled, theory-driven model(s) of the fundamentals, (backed by good engineering and thorough data management.)

    Because the problem domain itself is so complex, and the data so sparse, the models themselves have to be simple, which means that most of the opportunities for innovation are to be found in the search for new and previously underexploited data sources. This is particularly true when looking at early-stage startups, as financial history is either absent or not particularly predictive.

    Perhaps fortunately, the current state of the art in data exploitation is really quite poor, meaning that many opportunities exist to improve the state of the art.

    So, what opportunities can we identify?

    Well, organizations are composed of people. Different organizations have different personalities, and different cultures; sometimes the people in those organizations gel together and turn into a great and highly productive team, and sometimes they do not.

    If you were able to develop a really good understanding of how people work together in teams, and how different personality types, personal circumstances, technical skills and work environments come together, you could build what could be a pretty strong factor based on staff surveys, psychometric profiles and whatever other behavioral data you can lay your hands on.

    (You could also use the same models to build a secondary business offering personal and organizational coaching… 🙂 )

    The cost of obtaining this data would be quite high, unfortunately, but there are other factors that could be attractive based simply on the ease with which large quantities of data may be collected.

    For example, a statistical analysis of source code repositories and checkin histories might well yield insights into the ability of the organisation to respond to changing conditions, and to rapidly innovate.

  2. As you rightly point out, quantitative investing is really quite hard, even when trying to analyse equities with decades of financial reports available. The world is complex and always changing, with no guarantee that historical conditions will prevail. Furthermore, the data that you have is generally extremely sparse relative to the dimensionality of the problem, requires lots of massaging to normalize, and is full of wrinkles that must be ironed out to normalize it for comparison (retrospective corrections, stock splits and other corporate actions etc… etc…).

    The natural (and only) solution is to approach the problem with really strong priors in the form of a set of principled, theory-driven model(s) of the fundamentals, (backed by good engineering and thorough data management.)

    Because the problem domain itself is so complex, and the data so sparse, the models themselves have to be simple, which means that most of the opportunities for innovation are to be found in the search for new and previously underexploited data sources. This is particularly true when looking at early-stage startups, as financial history is either absent or not particularly predictive.

    Perhaps fortunately, the current state of the art in data exploitation is really quite poor, meaning that many opportunities exist to improve the state of the art.

    So, what opportunities can we identify?

    Well, organizations are composed of people. Different organizations have different personalities, and different cultures; sometimes the people in those organizations gel together and turn into a great and highly productive team, and sometimes they do not.

    If you were able to develop a really good understanding of how people work together in teams, and how different personality types, personal circumstances, technical skills and work environments come together, you could build what could be a pretty strong factor based on staff surveys, psychometric profiles and whatever other behavioral data you can lay your hands on.

    (You could also use the same models to build a secondary business offering personal and organizational coaching… )

    The cost of obtaining this data would be quite high, unfortunately, but there are other factors that could be attractive based simply on the ease with which large quantities of data may be collected.

    For example, a statistical analysis of source code repositories and checkin histories might well yield insights into the ability of the organisation to respond to changing conditions, and to rapidly innovate.

    1. Very interesting points. Regarding collecting people data, it is interesting that other industries that depend heavily on recruiting the very best people use testing extensively – think the standard case studies in a McKinsey or Bain interview, the Caliper personality test that many hedge funds require their candidates to take, or the various assessments developed by SHL (which was recently acquired for $660 million) for Wall Street firms. Yet I’m not aware of any VC asking founders to take this type of test (although I’m sure it has been done). I wonder whether that’s due to people not believing in the effectiveness of tests as a valid predictor of success, or deeply ingrained industry practices (that’s just not the way things are done), or the fact that a VC requiring this type of tests as part of their process would place themselves at a disadvantage when competing for the hottest deals, or some other factor.

  3. I don’t know if this method is used by anyone in practice, but I came across this paper awhile ago. It suggests applying Bayes Nets from Decision Analysis to VC. http://web.ku.edu/~pshenoy/Papers/AOM02.pdf . Generally Decision Analysis seems like a solid framework to me, I see no reason why it wouldn’t work in VC investing. Additionally there is less of an inferential gap to use these kinds of models for people who are not machine learning experts – the models basically just formalize causal influences of investment factors and put some probabilistic discipline around them, plus they can be implemented using free BN construction software like SamIM or GenIE.

  4. Matt, I was delighted to see that you had discovered our app, Semgel while researching the idea of data driven venture capital.

    While you correctly identify that the sparsity of data makes it difficult to develop comprehensive predictive models for vc investment, the fact is that, even with a limited amount of data, we can indeed gain a ton of valuable insights about the market & funding landscape.

    This was essentially the motivation behind Semgel’s value proposition – guiding entrepreneurs and investors make more informed investment decisions with the data that is available in the public domain(albeit sparse).

    —-
    In response to a tweet, you suggested that I post a comment about our attempt to categorize the consumer-web space. So, here we go!

    While trying to gain a better understanding of the tech landscape, we realized that the consumer-web landscape in particular can be cleanly organized into a 2D matrix of horizontals (products/services, retailers/market-paces/aggregators, content/commerce) and verticals(accommodation, transport… 10 in all).

    This is a original categorization scheme that Semgel has developed to organize the consumer-web space. We currently have only categorized a few thousand companies in Crunchbase – all of them funded. We plan to rapidly expand this to cover all startups in CB. Its very much a work in progress.

    What makes this thing interesting is that Entrepreneurs & Investors can learn lessons from one segment in this matrix and apply it to an another. They can understand the stage at which a segment is in, and more generally how segments can evolve and the time it takes for them to evolve.

    We believe this can be a usefull in guiding entrepreneurs and investors to assess a given segment and understand where it is heading. Do take a look by visiting one of the segments (eg http://semgel.com/data/segment/sg11100-product-search)

    —-
    In this context, you might also want to take a look at a presentation by @joshyang, specifically this slide(http://www.slideshare.net/joshyang/ecommerce-landscape-2012/20) which tries to describe how segments tend to evolve over time.

    Harish,
    Founder, semgel.com

  5. VC Experts (www.vcexperts.com) maintains an online database containing the specific terms & conditions of venture financings, prices of preferred & common stock issuances, valuations, and thousands of other data points. Data comes from official filings like the Certificates of Incorporation, EPEN’s, LOEN’s, Annual Returns, etc.

    Excellent resource for any startup or investor.

    https://vcexperts.com/vat/companies

    1. Thanks for bringing it to my attention Mike. The Private Company Analysis Tool looks very cool. Other than the various uses you describe on your website, are you aware of any VC using your data to create sophisticated decision models?

  6. First of all, thanks for mentioning CB Insights and Mosaic.

    Re: VCs, data and Moneyball…..here are my stream-of-consciousness thoughts on this…

    We do see many of our VC customers using data in very interesting and systematic ways – analyzing industry trends including financing flows & exits, improving their deal flow management practices, understanding investment patterns of peer investors, etc. Generally, we’ve found bigger firms (by fund size) are the ones keen to use data in more sophisticated ways. In many instances, they’re taking data feeds from us and others and vacuuming them into their dealflow management systems to aid in day-to-day operations.

    As for the moneyball idea, you raise some good points about why a quantitative approach to venture capital may be difficult. There are some data limitations, of course, in terms of sheer quantity vs. something like public markets, but a lot of the reasons that data is less used are structural or behavioral in nature. While there are many, a couple of these reasons are detailed below. Pls note that these are generalizations. As mentioned above, we see and know many VCs taking a very progressive attitude towards the use of data, but they’re the exception and not the rule and generally, they’re the more successful VCs (if you look at VC fundraising as a proxy for success given returns data is so opaque)

    First, VC is a long game – Unlike public markets or baseball where you have an immediate scorecard to tell you how you’re doing, VC doesn’t have that. As a result, the desire/need for data is less clear especially when you won’t know for many years if your investment thesis was correct or crap.

    Next, despite investing in the next big thing and often being exposed to cutting edge innovation via their portfolio, the process VCs use to find deals is hopelessly antiquated. This myth of proprietary
    dealflow has been so ingrained into the VC ethos that using data in the moneyball way is almost anti-thetical to how “VC is supposed to be done.” The secret, of course, is that outside of ~20 firms, most VCs don’t have real proprietary deal flow. But instead of using data, most would prefer to throw bodies (analysts) who are out researching to find hidden gems. Or attend demo days of accelerator programs and the like which offer them no preferential, proprietary access to dealflow.

    Again, this is for the majority of firms. There are some who are quite progressive. And of course, there are new models a la 500 Startups in the micro-VC category which talk about/appear to have a data orientation.

    But given the returns in the industry have been subpar and the decreasing allocations of LP money to the venture capital asset class overall, a rethink of process might be forced upon VCs as doing more of the same with the expectation of a different result defines insanity.

    Best,
    Anand

  7. Thanks Anand, well said. I don’t think that even the top 20 firms have much of a proprietary deal flow at this stage; they do get to see some deals first, but unless the entrepreneur has a specific reason to be loyal to them (e.g., the VC backed her previous company, or she’s an EIR at that firm, etc.), that only gives the VC a short lead over other top 20 firms (or top 10, or top 5) that will almost inevitably catch on to the opportunity. Of course, in theory, a few days’ lead is sometimes all you need as a VC if you’re able to recognize a great opportunity immediately and can move quickly. But reality is often more complex.

  8. For all the reasons stated, I think it’s hard to use data to accurately predict which startups will yield great exits years later. However, I do think it’s probably possible to use data to much more quickly figure out which companies to analyze. For instance, I wonder how many firms receive alerts when a company’s site or app is starting to breakout in terms of traffic. I assume many firms use such tactics as a useful tool to fill the top of the deal flow funnel. Especially with the proliferation of consumer startups globally, and the likelihood of not hearing about them through the grapevine in Silicon (V)alley until it’s too late.

    Also, check out dataminr as a tool that could be useful for identifying breakout companies using social media signals. The problem is it’s not tailored to startup investors. Probably most tools will have to be hacked together by VC firms themselves, as the market for VC tools isn’t very large.

    1. I love the idea of the “alert” system. It’s been (or is being) done for market finance (my current employer, Bloomberg is pretty much best in class in terms of breaking finance news, which it captures through a combination of advanced technology and lots of boots on the ground around the world). Dataminr is indeed interesting from that perspective as well. Other interesting startups like Next Big Sound are doing this for music (to figure out the next big act before it breaks).

Leave a Reply to mattturck Cancel reply

Your email address will not be published. Required fields are marked *