“In Conversation” Series

In conversation with Richard Craib, Founder, Numerai

I’ve been interested in the intersection of AI and crypto for a while (see AI & Blockchain: An introduction), and Numerai is one of the most exciting companies I came across in that world. Numerai is a new kind of crowdsourced quant hedge fund, which provides data for free and enables any data scientist around the planet to contribute models they believe will beat the stock market. Numerai offers its own token, called Numeraire, to incentivize participants.

As it turns out, this model delivers exciting results, and Numerai announced a few months ago that it had outperformed market neutral hedge funds by 29%.

It was a real pleasure welcoming Richard Craib, founder of Numerai, to Data Driven NYC to talk about the very exciting work Numerai has been doing.

Below is the video and full transcript.

In Conversation with Mark Grover, CEO, Stemma

As the volume of data in the enterprise continues to explode, with ever large amounts stored in data warehouses and data lakes, the problem of data discovery has become an increasingly painful one. How do data analysts, data scientists and business people find not just data, but the right data for the problem they need to solve? How do they know how it was produced, how recently it was updated and whether that’s the right dataset they need to use? In addition, from an organization’s perspective, there’s a question of data governance – how to manage access in a way that preserves data security and privacy, and ensures compliance with data protection regulations (GDPR, CCPA, etc.).

Data catalogs have been a powerful response to those problems, and that category has seen renewed activity in the last couple of years with a whole new group of startup entrants.

At our most Data Driven NYC, we got a chance to chat Mark Grover, co-founder and CEO of Stemma and the co-creator of Amundsen, the leading open source data discovery and metadata engine. Mark built Amundsen while he was a product manager at Lyft and started Stemma to offer a fully managed Amundsen.

It was a fun conversation about the space. Below is the video and below that, the transcript.

In Conversation with Aaron Katz, Co-Founder & CEO, ClickHouse

Ask anyone who spends time in the data ecosystem, and the name “ClickHouse” is one that has come up countless times in conversations over the last few years.

ClickHouse is a real-time OLAP (meaning, analytical) database that is known for its performance and scalability, and has a wide footprint of users around the world.

ClickHouse started its life at Yandex, the Russian search giant. It was originally created as an internal web analytics tool called Metrica, which evolved around 2009 into “Clickstream Data Warehouse” or ClickHouse for short.

The product was open sourced in 2016 and became a very popular project, with adoption at impressive scale by a number of companies including Yandex (10s of trillions of rows), Uber, Ebay, Cloudflare, Spotify, Deutsche Bank, and more.

ClickHouse was spun out into early 2021 into ClickHouse, Inc., a commercial company co-founded by Aaron Katz, Alexey Milovidov (ClickHouse’s creator), and Yury Izarilevsky (ex-Google VP Engineering), with a focus on bringing ClickHouse to all types of companies via a managed version.

ClickHouse Inc raised a $50M Series A announced in September, followed closely by a $250M Series B last month, in which my firm, FirstMark, participated.

It was a treat to welcome Aaron Katz, the Co-Founder and CEO of ClickHouse, Inc. to Data Driven NYC. Prior to co-founding ClickHouse, Aaron had extensive experience as a world-class sales leader, most recently as the Chief Revenue Officer at Elastic and the Senior Vice President of Enterprise Sales at Salesforce

Below is the video and below that, the transcript.

The Data Mesh: In Conversation with Zhamak Dehghani

In the admittedly small world of people who obsess over data technologies, one of the hottest topics of the last year has been the “data mesh”.

Created by Zhamak Dehghani of ThoughtWorks, the concept struck a chord and made the rounds in countless conversations on Twitter and elswhere.

As I highlighted in the 2021 MAD Landscape, the data mesh concept is both a technological and organizational idea. A standard approach to building data infrastructure and teams so far has been centralization: one big platform, managed by one data team, that serves the needs of business users. This has advantages, but also can create a number of issues (bottlenecks, etc). The general concept of the data mesh is decentralization – create independent data teams that are responsible for their own domain and provide data “as a product” to others within the organization. Conceptually, this is not entirely different from the concept of micro-services that has become familiar in software engineering, but applied to the data domain.

It was a real treat to get to chat with Zhamak at our most recent Data Driven NYC.

Below is the video and below that, the transcript.

In Conversation with Elementl (Dagster), Meroxa and Superconductive (Great Expectations)

This last year has seen tremendous levels of activity for early stage startups in the data infrastructure ecosystem. At our most recent Data Driven NYC, we featured some of the rising stars:

Nick Schrock, Founder & CEO, Elementl (Dagster) | Elementl is building the next generation of open source data tools including Dagster, the open-source data orchestrator for machine learning, analytics, and ETL.
DeVaris Brown, Founder & CEO, Meroxa | Meroxa is a real-time data platform that gives data teams the tools they need to build real-time infrastructure in minutes.
Abe Gong, Founder & CEO, Superconductive (Great Expectations) | Superconductive is the team behind Great Expectations, the leading open source tool for defeating pipeline debt through data testing, documentation, and profiling. The company’s mission is to revolutionize the speed and integrity of data collaboration.

In Conversation with Ali Ghodsi, CEO, Databricks

Databricks is an enterprise software giant in the making. Most recently valued at $28B in a $1B fundraise announced in February 2021, the company has global ambitions in the data and AI space.

An unlikely story of a company started by seven co-founders, most of whom were academics, built around the Spark open source project, Databricks is heading towards a monster IPO that will accelerate its rivalry with its chief competitor, Snowflake.

I had a chance to interview then co-founder and then CEO Ion Stoica at Data Driven NYC back in 2015, when Databricks was a company very aggressively courted by VCs, but still very early in commercial traction.

It was a real treat to catch up with Ali Ghodsi, who took over as CEO in 2015.

Below is the video and below that, the transcript.

In Conversation with Victor Riparbelli (CEO) and Matthias Niessner (Co-Founder), Synthesia

One of the most exciting emerging areas for AI is content generation. Powered by anything from GANs to GPT-3, a new generation of tools and platforms enables the creation of highly customizable content at scale – whether text, images, audio or video – opening up a broad range of consumer and enterprise use cases.

At FirstMark, we recently announced that we had led the Series A in Synthesia, a startup providing impressive AI synthetic video generation capabilities to both creators and large enterprises.

As a follow up to our investment announcement, we had the pleasure of hosting two of Synthesia’s co-founders, Victor Riparbelli (CEO) and Matthias Niessner (co-founder and a Professor of Computer Vision at Technical University of Munich).

Some of topics we covered:

The rise of Generative Adversarial Networks (GANs) in AI
Use cases for synthetic video in the enterprise
Synthetic videos vs deep fakes
What’s next in the space

Below is the video and below that, the transcript.

In conversation with Dev Ittycheria, CEO, MongoDB

MongoDB’s path from unlikely NYC enterprise tech startup to global category leader has been amazing to watch.

I’ve had the pleasure of hosting two of MongoDB’s co-founders over the years, first Dwight Merriman back in 2012 (here) and then CTO Eliot Horowitz in 2016 (here). So it was a real treat this time to get to chat with CEO Dev Ittycheria, who has been leading the company since 2014, and it particular has presided over the company’s remarkable ride in public markets since its 2017 IPO.

In addition to being a truly world-class CEO, Dev has had an outsized impact on the New York tech scene, as he’s been playing a central role both at MongoDB and also at Datadog, where he’s been a long time board member (after leading the company’s Series B back in 2014).

We had a wide-ranging conversation where we covered:

Dev’s journey as a CEO and investor
The evolution of enterprise tech in New York
MongoDB’s database as a service offering, Atlas
Newest products and product roadmap
Open source
GTM strategies, bottoms up vs top down
Lessons in scaling the team
Being a student of the game rather than a master of the game

In Conversation with Florian Douetteau, CEO, Dataiku

Dataiku (in which I’m a proud investor and board member) has had an impressive ride over the last few years. An early entrant in the enterprise Data Science and Machine Learning platform category, the company successfully expanded from its French/European roots to build a very strong presence in the US (where it is company is now headquartered) and, increasingly, Asia.

Along the way, Dataiku:

became a unicorn, most recently raising a $100M Series D in 2020
was named a “Leader” in Gartner’s Magic Quadrant for Data Science and ML Platforms in both 2020 and 2021
collected many accolades, such as CB Insight’s “AI 100” and several of Forbes lists: “Cloud 100”, “AI 50” and “America’s best startup employers in 2021”

It was really fun to host CEO Florian Douetteau at Data Driven NYC once again, after previous appearances in 2016 (here) and 2018 (here). We covered a bunch of different topics, including:

What enterprise AI is about: not flying cars, but optimizing hundreds of business processes
Why enterprises need to move past their fear of data and AI
The key principles behind the design of the Dataiku platform: handling the entire data lifecycle, and democratizing data/AI across teams
Dataiku’s partnership with Snowflake
The upcoming launch of their starter / SMB self-serve product, Dataiku Online

Below is the video and below that, the transcript.

In Conversation with Bindu Reddy, CEO, Abacus

At our most recent Data Driven NYC, we had the great pleasure of hosting Bindu Reddy, CEO and co-founder Abacus AI, and formerly GM & creator of AI verticals at AWS, and an ex-Googler. Bindu also has a very witty and entertaining Twitter account (@bindureddy), where she talks about all things machine learning and AI.

This was a very educational and approachable conversation, where we covered:

some key definitions: neural networks, weights and biases, supervised vs unsupervised learning, feature store
Applying neural networks to structured, tabular data
Abacus’ vision around “autonomous AI”
How companies wait too long to start experimenting in ML/AI

Below is the video and below that, the transcript.

In Conversation with Dave Burgess, Head of Data Engineering, Pinterest

Pinterest is near and dear to our hearts at FirstMark because we had the good fortune of being the first institutional investor back in 2009 when the company was just getting started (fun fact: the founders were in New York for a brief moment in time before moving to the Bay Area). Pinterest has had a remarkable ride ever since, and it’s a $49B market cap public company at the time of writing.

So it was a particular pleasure to welcome Dave Burgess, Head of Data Engineering, to come and talk to the Data Driven NYC audience about all things data at Pinterest.

We covered a bunch of interesting topics, including:

Pinterest’s newly open sourced project, QueryBook
The stack Pinterest uses to manage is 400 petabytes of data
The use cases for data analytics and machine learning at Pinterest

Below is the video and below that, the transcript.

In conversation with Arjun Narayan, CEO, Materialize

Real-time data streaming is an increasingly crucial part of the data ecosystem. While financial services (trading) initially represented the bulk of the demand for streaming, the emergence of more mature technology in the space has unlocked more use cases, which in turn created more demand for better technology.

At a recent Data Driven NYC, we had a very interesting conversation with Arjun Narayan, CEO of Materialize, “the only true SQL streaming database for building internal tools, interactive dashboards, and customer-facing experiences”. Materialize is headquartered in New York and has raised $40M in venture capital money (with a new round rumored to be announced soon, at the time of writing).

This was a very educational discussion, where we covered the following topics:

What is streaming? What is Kakfa?
Why is there a need for a streaming database for analytics?
Why is SQL underrated?
What is Materialize?
Partnering with DBT to make streaming ubiquitous
Materializes’s roadmap

Below is the video and below that, the transcript.

In conversation with Chip Huyen, Writer and Computer Scientist

At our most recent Data Driven, we had the great pleasure of hosting Chip Huyen, a writer and computer scientist who also teaches machine learning design at Stanford, for a fascinating and fun conversation.

We covered a range of topics, including:

What is machine learning design?
The MLOps landscape, and how it’s both overdeveloped and under-developed
What is online machine learning?
The divergence between East and West for machine learning and data infrastructure
A couple of book recommendations

Below is the video and below that, the transcript.

In Conversation with Jack Hanlon, VP Data, Reddit

While it’s been around for 15+ years, Reddit has been on a tear lately: a $367M Series E round announced a few weeks ago, rumors of an IPO, and plenty of Internet action with r/wallstreetbets in particular.

Interestingly, there was a major gap for many years between the central role Reddit has been playing on the Internet and its relatively small team size. While companies like Facebook are largely AI companies (see our conversation with Jerome Pesenti, Head of AI, Facebook), Reddit’s data team was tiny.

Enter Jack Hanlon, VP Data at Reddit and our guest at our most recent Data Driven NYC event. Jack has been tasked with leading the data team into rapid growth, and we had a really interesting conversations, in particular around the following points:

How is the data team at Reddit organized? (preview: data science, data platform, machine learning, search)
What’s the data stack? (preview: switch from AWS to GCP, Kafka, Airflow, Colab, Amundsen, Great Expectations, Druid/Imply…)
What are the key use cases for data science and machine learning at Reddit?
A book recommendation: “Invisible Women: Data Bias in a World Designed for Men”

Anecdotally, Jack is our second speaker in recent memory who was a regular attendee in the early years of Data Driven NYC, before ascending to leadership responsibilities in a major Internet company! (the other being Alok Gupta, who spoke about leading data at DoorDash).

Below is the video and below that, the transcript.

In conversation with Guy Podjarny, Founder & President, Snyk

In just a few years of hyper growth, Snyk has become a $2.7B unicorn, most recently raising $200M in September 2020. A developer-first security company, it has also helped usher the “DevSecOps” category.

At our most recent Data Driven NYC, we had the pleasure of hosting its Founder & President, Guy Podjarny, zooming in late at night from Israel.

We covered many interesting topics, including:

What does DevSecOps mean?
How did Snyk initially get developers to care, and how did they expand horizontally from there?
What is infrastructure as code?
Thoughts Snyk Code and Snyk’s vulnerability database
The nuances of combining a bottoms-up, freemium motion focused on developers, with an enterprise motion focused on economic buyers of Snyk’s products.

Below is the video and below that, the transcript.