The field of bioinformatics is having its “big bang” moment. Of course, bioinformatics is not a new discipline and it has seen various waves of innovations since the 1970s and 1980s, with its fair share of both exciting moments and disappointments (particularly in terms of linking DNA analysis to clinical outcomes). But there is something special happening to the industry right now, accelerated by several factors:
Thomson Reuters CTO James Powell runs a great series of podcasts where he interviews people in the technology world about topics of relevance to his organization. I was fortunate to be invited to speak with James about the Internet of Things and Big Data, and it was a lot of fun. Below is the podcast, uploaded on SoundCloud. Thanks to James Powell and Dan Cost for the opportunity.
Some updates on the event/community front:
1) A little while ago, I changed the name of the data event I’ve been organizing from “NYC Data Business Meetup” to “Data Driven NYC”. I originally started the event mostly as experiment, and didn’t give much thought to branding (so yeah, that was a terrible name). The event has now grown quite a bit (over 3.700 members as I write this), so it was time for a better name; also at this stage, it feels more like a community than “just” a meetup, so I wanted a name that reflected this reality.
2) Back in June, I launched a new community called “Hardwired NYC”. It covers startups, technologies and products at the intersection of the physical and digital worlds, including topics like 3D printing, Internet of Things, wearable computing, etc. I developed a strong interest in those areas through my involvement in the Big Data world – the Internet of Things, in particular, is deeply intertwined with Big Data (the proliferation of sensors has been contributing to the Big Data “problem”; equally the Internet of Things will be highly dependent on Big Data technologies if it is to deliver on its promise).
3) As Hardwired NYC is taking off fast (more than 700 members after just two events), I figured that both events/communities should have their own website with full video libraries, including for people who don’t live in New York and are interested in the content. So, with the great help of my FirstMark colleague Dan Kozikowski, I’m launching this week www.datadrivennyc.com and www.hardwirednyc.com. Both sites have a “Watch” section where, from now on, I will post pictures and videos of events (as opposed to this blog).
A few weeks ago, I was invited to do a couple of guest lectures at NYU (as part of the excellent “Ready, Fire, Aim” entrepreneurship class that Lawrence Lenihan, now my partner at FirstMark, has been doing for a while there) and at The New School (as part of a Big Data course organized by Debra Anderson and Greta Knutzen). Thought I’d share the slide deck I had prepared for those classes. Very much a Big Data 101 class for a college-level audience that had had little or no exposure to the key concepts prior to the class.
Our February NYC Data Business Meetup was focused on the intersection of data and finance (both market and consumer finance). Quantopian, Plaid and ZestFinance presented.
We also had a great panel presenting the customer perspective on Big Data (hype vs. reality), from a financial institutions’ viewpoint, with the following speakers: Mike Simone (Global Head of CitiData Platform Engineering), Emile Werr (Head of Enterprise Data Architecture, NYSE EuroNext) and Raj Patil (up until recently Data innovation CTO at UBS, now an entrepreneur). Unfortunately, due to standard policy at some of those institutions, we can’t publicly post the video of the panel.
Here are the videos, in order of appearance (we also had a great “customer panel
Bloomberg App Portal:
The December NYC Data Business Meetup was focused on big data infrastructure companies, with the co-founders of Sqrrl, Infochimps and MemSQL presenting to a full house. We started the evening with a presentation by prominent data scientist Joseph Turian.
Here are the videos:
Joseph Turian, “How to do AI in 2013”
Oren A. Falkowitz, Co-Founder & CEO, Sqrrl
Dhruv Bansal, Co-Founder & Chief Science Officer, Infochimps
Eric Frenkiel, Co-Founder & CEO, MemSQL
And here are a few pics (photo credit: Shivon Zilis):
The November NYC Data Business Meetup was focused on “vertical-specific” applications of big data – startups leveraging the big data stack to offer new solutions to specific industries, such as finance and government (Recorded Future), the legal industry (Lex Machina), energy (DataMarket, although it offers data sets for other industries as well) and sports (numberFire).
Here are the videos:
Christopher Ahlberg, CEO, Recorded Future:
Josh Becker, CEO, Lex Machina:
Hjálmar Gíslason, CEO, DataMarket:
Nik Bonaddio, CEO, numberFire:
Here are the videos from the NYC Data Business Meetup that was held on October 23, 2012, in order of appearance:
Jeff Carr, COO, Precog
Max Yankelevich, co-founder, CrowdComputing Systems
Roger Ehrenberg, Founder and Managing Partner, IA Ventures; Ping Li, General Partner, Accel Partners; Matt Ocko, Co-Founder and Partner, Data Collective (from left to right):
So here we are again. My colleague Shivon and I had made a first attempt at making sense of the rapidly evolving big data ecosystem back in June. Based on some very helpful feedback from readers of this blog and others, a number of additional meetings with interesting startups and more in depth research, we’ve come up with this second version.
- It’s still a work in progress (and will presumably always be, that’s the nature of the beast)
- It’s even more crowded than the first time around, which reflects the incredible vitality of the big data space
- We’ve created some new subcategories such as NoSQL/NewSQL and analytics services (reflecting the reality that, for the time being, the last mile of data analysis is very much performed by humans)
- We have the occasional company that appears in different categories (Infochimps or Autonomy for example)
- We have learned more about companies that were already on the first version of the chart, and have positioned them differently. For example, Metamarkets now falls in the “Cross Infrastructure/Analytics” category as they offer a stack that includes a data store (Druid), predictive analytics and visualization. Another example is Collective[i] – they have built an entire proprietary big data stack from the ground up, that includes infrastructure, analytics and applications – making the company a rare example of an “Application Service Provider”.
Our goal is to continue updating this chart from time to time, and perhaps make it evolve visually, as we’ve probably reached the limits of what we can reasonably fit on one slide. It was suggested that we try to visually distinguish on premise offerings vs. cloud based solutions, which we may try to do.
To enlarge, click on the arrows at the bottom right of the chart.
Comments, thoughts, questions? Please add to the comments section.
Here are the videos and some pictures (scroll down) of the NYC Data Business Meetup held on September 25, 2012
In order of appearance:
1) Rick Smolan told us about his fascinating new project, the “Human Face of Big Data” – see the NY Times coverage here: http://nyti.ms/TO5MDd.
2) Mortar (presenter: K Young, CEO). Mortar (www.mortardata.com) provides a platform-as-a-service for Hadoop. They take care of all of the necessary infrastructure (via AWS) and allow any software engineer to run jobs on Hadoop using Apache Pig and Python without special training.
3) Datadog (presenter: Alexis Le Quoc, co-founder). Datadog (www.datadoghq.com) is a service for IT, Operations and Development teams who write and run applications at scale, and want to turn the massive amounts of data produced by their apps, tools and services into actionable insight. Datadog helps software developers and web ops understand their IT Data by putting it all in context.
4) We finished with a fireside chat with Dwight Merriman, CEO and co-founder, 10Gen. 10Gen (www.10gen.com) develops MongoDB, and offers production support, training, and consulting for the open source database. Dwight is one of the original authors of MongoDB. In 1995, Dwight co-founded DoubleClick (acquired by Google for $3.1 billion) and served as its CTO for ten years. Dwight was the architect of the DoubleClick ad serving infrastructure, DART, which serves tens of billions of ads per day. Dwight is co-founder, Chairman, and the original architect of Panther Express (now part of CDNetworks), a content distribution network (CDN) technology that serves hundreds of thousands of objects per second. Dwight is also a co-founder and investor in BusinessInsider.com and Gilt Groupe.
Here are the some videos, slides and pics from the most recent NYC Data Business Meetup. The videos are unfortunately not of the greatest quality, but are good enough to watch.
Also, note to self: make sure that our audience of 200+ sits closer to the stage, so that the room doesn’t look tragically empty on camera (rookie mistake)!
In order of appearance:
1) Todd Papaioannou, CEO, Continnuuity, a stealth big data startup, based in Palo Alto, CA and backed by Andreessen Horowitz, Battery Ventures, Data Collective and a number of high profile angels. Todd was previously Chief Cloud Architect for Yahoo.
2) Neil Capel, CEO, and Daniel Krasner, Chief Data Scientist, Sailthru, a New York based startup backed by RRE, AOL Ventures, Lerer Ventures, DFJ Gotham, Thrive Capital, Metamorphic, etc. Sailthru provides fully automated, 1:1 email and onsite recommendations using a unique behavioral targeting platform. Sailthru helps brands cut through the clutter and build trust with their customers by recognizing and acting upon their individual interests. Sailthru’s technology creates individual user profiles associated with each person’s email address and online behavior. Sailthru’s algorithms gauge each individual user’s intent and match appropriate content and frequency of email communications such that every email is tailored to the unique user. That means they send as many permutations of an email as there are recipients. All simultaneously, all automated and all in real time.
3) Dennis R. Mortensen, CEO and Jeroen Janssens, Data Scientist,Visual Revenue, a New York based startup backed by Lerer Ventures, SV Angel, IA Ventures and Softbank. Visual Revenue increases front page performance for online media organizations. Their platform provides Editors with actionable, real-time recommendations on what content to place in what position right now and for how long. Visual Revenue’s predictive analytics technology allows media organizations to proactively manage the cost of exposing a piece of content on a front page, whilst maximizing the return they expect from promoting it.
4) Panel discussion and Q&A with the audience
I have been very intrigued by the recent emergence of “data driven” firms, aiming to use data to reinvent venture capital.
While they certainly review various data points and metrics before deciding to invest in a startup, as of today venture capital investors largely operate based on “pattern recognition” – the general idea being that, once you’ve heard thousands of pitches, sat on many boards and carefully studied industries for years, you become better than most at predicting who will make a strong founder/CEO, what business model will work and eventually, which startup will end up being a home run. The trouble is, the model doesn’t always work, far from it, and many VCs end up making the wrong bets, resulting in disappointing overall industry results. Could VCs be just like the baseball scouts described in Moneyball, who think they can spot future superstars because they’ve seen so many of them before, but end up being beaten by a cold, objective, statistics-based approach?
Enter several firms trying to do things differently:
- Google Ventures has created various data-driven algorithms that inform their investment decisions – see the team discussing the concept at last year’s Web 2.0 Summit here.
- Correlation Ventures raised $165M earlier this year for its first fund, which was reportedly oversubscribed (a rarity for a new fund). Correlation says it has built the “world’s largest, most comprehensive database of U.S. venture capital financings”, which covers “the vast majority of venture financings that took place over the past two decades, tracking everything from key financing terms, investors, boards of directors, management backgrounds, industry sector dynamics and outcomes”. Based on this data, Correlation has developed predictive analytics models which it uses to guide its investment decisions – as a result, it can make decisions very quickly (less than two weeks) and doesn’t require additional due diligence.
- Just earlier this week, E.ventures (which results from the relaunch of BV Capital) also emphasized its own data-driven approach to investment decisions
Since I’m a big fan of anything data-driven (decisions, product, companies), the concept resonates strongly with me. Predictive analytics have been successfully used in various industries, from retail to insurance to consumer finance. Other asset classes are highly data driven – fundamental and technical analysis drive billions of dollars of trade; hedge fund quants spend their lives building complex models to price and trade securities; high-frequency trading bypasses human decision making altogether and invests gigantic amounts of money based solely on data. In this world where everything gets quantified, why should venture capital be an exception?
However, as much as I like the idea, I believe venture capital doesn’t lend itself very well to a model-heavy, quasi “black box” approach. The creation of a reliable, systematic predictive model is a particularly challenging task when you consider the following obstacles:
- A relatively sparse data set: while by definition there’s not much data about early stage startups, you could argue that that amount is constantly increasing, as everything is moving online, and everything online can be measured. You could also argue that, if you could have access to all historical data from all VC firms in the country, and efficiently normalize it, you would end up with a lot of data. But still that amount of data would pale in comparison to what’s available to public market investors – Bloomberg processes up to 45 billion “ticks” (change in the price of a security)… daily.
- Limited intermediary feedback points: Before getting to a final outcome (game lost or won), baseball is full of small binary outcomes (a player hits the ball or he doesn’t). Similarly, in market finance, the eventual success of strategy can typically be broken down in many different points with binary outcomes (you make money or you don’t). In venture capital, before getting to a final outcome (a startup has a liquidity event), it’s unclear how many of those intermediary, measurable points you get, that can enable you to build models – perhaps a few (the startup’s next round is an “up round” or a “down/flat round”) but certainly nothing compared to the above examples.
- Extended time horizon: in baseball, the rules of the game do not change from game to game, or season to season. In venture capital, the “game” can last for years, because investments are highly illiquid. During that time, pretty much anything can change – regulatory framework, unforeseen disruptive forces in the industry, etc.
In addition, it would be interesting to see how startups react in the long run to investors who are interested in them mostly because they scored well on a model, as opposed to spending extended time getting to know them. Unlike public stock markets, venture capital fundraising is a two-way dance, and startups often pick their investors as much as their investors pick them.
However, while I have my doubts about using data models as valid predictors of the overall success of an early stage startup, my guess is that there are still plenty of interesting insights to be gleaned from the data, and that forward-thinking VC firms could gain a competitive advantage by actively crunching it – my sense is that very few firms have done so at this stage.
Interestingly, there are some good data sources and emerging technologies out there that could be leveraged as a first step, without engaging into a massive data gathering or technology development effort:
- Public (and/or free) sources: Crunchbase is a great source of data. There are many directions you could go with mining it – as an example, see what Opani (an early stage NYC big data company) came up with here. I bumped into Semgel, a web app that has taken a stab at instantly gathering and analyzing Crunchbase data. The Crunchbase data could be augmented with data from marketplaces such as Factual. See also this intriguing article about pre-money valuations of startups (typically not information that’s disclosed) could possibly be mined from publicly available Delaware certificates of incorporation and similar documents in other states.
- Private Databases: There a few interesting databases that collect and organize more complex information flows around private companies such as CB Insights (which also offers a data-driven tracking tool called Mosaic)
- Technologies: In addition to the various open-source big data tools, there are some technologies/companies that could be leveraged to mine VC industry data, including for example Quid, co-founded by the talented Sean Gourley – “understanding co-investment relationships and deriving investment strategies” is one the challenges they address.
If anyone is aware of other efforts around crunching data relevant to VCs, or other ways VCs have been used a heavily data-driven approach, I’d love to hear about it in the comments.
My colleague Shivon Zilis has been obsessed with the Terry Kawaja chart of the advertising ecosystem for a while, and a few weeks ago she came up with the great idea of creating a similar one for the big data ecosystem. Initially, we were going to do this as an internal exercise to make sure we understood every part of the ecosystem, but we figured it would be fun to “open source” the project and get people’s thoughts and input.
So here is our first attempt.
A few things became apparent very quickly:
1) Many companies don’t fall neatly into a specific category
2) There’s only so many companies we can fit on the chart — subcategories as NoSQL or advertising applications, for example, would almost deserve their own chart.
3) The ecosystem is evolving so quickly that we’re going to need to update the chart often – companies evolve (e.g., Infochimps), large vendors make aggressive moves in the space (VMWare with Serengeti and the Citas acquisition)
What do you think? (click on the bottom right to expand)