In the world of data infrastructure, dbt Labs has undoubtedly been one of the most exciting startups to watch. The company is the creator and maintainer of dbt, a data transformation tool that enables data analysts and engineers to transform, test and document data in the cloud data warehouse. Beyond this, the company is empowering a new generation of data analysts and enabling them to create and disseminate organizational knowledge.
dbt’s CEO, Tristan Handy, is also one of the most thoughtful and interesting CEOs in the space, having played a pivotal role in the emergence of what’s often referred to as the “Modern Data Stack”, a suite of tools and processes that leverage the power of cloud data warehouses to bring data processing to the modern era.
We had the pleasure of hosting Tristan once during the pandemic in 2021 for a great online chat with Jeremiah Lowin, CEO of Prefect. It was a particular treat to welcome back Tristan, this time for our first in-person event since 2020!
Below is the video and full transcript. As always, please subscribe to our YouTube channel to be notified when new videos are released, and give your favorite videos a “like”! Also, if you’re in New York or come visit from time to time, please join the meetup group!
(Data Driven NYC is a team effort – many thanks to my FirstMark colleagues Jack Cohen, Karissa Domondon and Diego Guttierez. Also, a major THANK YOU to ADP / Lifion for hosting us in their beautiful space in Chelsea in New York).
TRANSCRIPT [edited for clarity and brevity]:
[Matt Turck] (00:03) … The company was founded in 2016?
[Tristan Handy] (00:08) 2016, yep.
(00:11) You personally are based in Philadelphia, but I think dbt is a globally distributed company, remote first?
(00:18) We made the decision in 2018 I think, to distribute the company. My two co-founders and I had previously worked at a company together called RJMetrics, based in Philadelphia. We had had a lot of challenges growing the engineering team at the velocity that we needed to, purely based in Philadelphia. And so we were like, we’re not going to try to do that again. So really we’ve been a distributed first company for two years prior to it being cool.
(00:52) Jumping in, because I think that’s a really interesting topic: how do you, now that you can be back in person, how do you guys manage that?
(01:04) I’ve become a disciple of the GitLab handbook. Really, I’m sure that people will continue to use GitLab for a long time, but I think people will continue to use the GitLab handbook for decades and decades. And so we’ve copied off of them from, first it was salary bands, and now it’s in person hybrid strategy. So we have a stipend for folks to work in co-working spaces if they… So Sid says, “Distributed work does not equal work from home.” So that’s the same thing. We have a stipend for everybody to work outside of their home. And then we also do a lot of in person meetups. So whether it’s team level or department level or company-wide, we do once a year, company-wide meetups.
(02:01) And for the same reason we’re all here tonight. There’s stuff that Zoom doesn’t do for you. I think that there are for anybody in the audience who’s used dbt, you’ve probably gotten the sense that it’s a little bit of a different product experience than most data products you’ve used before. And I think that there’s a lot of counterintuitive stuff that kind of went into the beginning of it. And I think a lot of, anytime you exist instead of a community but you decide to kind of issue the kind of best practice, conventional wisdom that it espouses, it’s actually easier to do that from the outside. And so I didn’t have any friends who were like, “Yeah, SQL is not cool at all,” back in 2016, because I didn’t have any friends talking about data.
(02:54) Sometimes, if you want to do something different, it’s nice to be on the outside of the community. I wrote a blog post back in 2016 back when I still wrote blog posts, because I couldn’t find enough consulting clients to fill my day. And the title of the blog post was “How to Build a Modern SaaS-Based Analytic Stack” or something like that. And it was essentially plugged together Fivetran and Stitch, and at the time it was just Redshift and a BI tool, like Mode or Looker or something like that. And the modern part of that was that you could actually, one, this system could kind of do anything. It was in reaction to analytics products that had, if you cast your mind back to 2016, the prior generation of analytics products was like Google Analytics and Mixpanel.
(03:56) And these kinds of very kind of vertical specific tools that you were very constrained in this set of things that you could know about the world in this given tool. And so this was a little bit the best of both worlds. You had kind of consumer great experiences plugging these tools together, and yet you could ask kind of arbitrarily complex questions. We started as a consulting business, we were called Fishtown Analytics, and the wonderful thing about it was that I was very confident that in any conversation with a client, I could always answer the question, “Yes.” Can you do this for me? And every single conversation with a business stakeholder in a data context is like, “That’s great, but can you help me understand this other thing?” And the answer in the modern data stack was always “Yes,” but it doesn’t take 10 data engineers to do it. And there’s nothing wrong with data engineers, but you like a certain amount of agility, you want to be able to turn around that answer quickly, as opposed to spinning up an agile project to work on it.
(05:09) Let’s talk about the Modern Data Stack, and what it means
(05:16) The original modern data stack was four layers. It was data ingestion, how do you get your data from all of your different upstream systems. It was data storage or warehousing, and how do you actually store and compute data. It was transformation. And then it was analytics, whether you wanted to define that as BI or notebooks or whatever.
There’s always been more data analysts than data engineers. There’s just, I don’t know, probably two orders of magnitude, more data analysts than data engineers. And so now that you have Redshift and you can do kind of arbitrarily complex compute inside of this very simple infrastructure, you just kind of show up with a SQL terminal and you can do whatever you want, that people like me are going to want to use that themselves and not to have all of the real fun work, the data transformation done upstream by data engineers in Scala or Python or whatever.
(06:22) There was this infrastructural shift that the cloud data warehouse represented that really… You always like have an infrastructural shift, and the very first thing that happens is, you plug it into the existing paradigm. And one of my favorite examples of this is how factories used to be laid out with this central line, because that’s how steam power used to get transmitted down the center of a factory. And it took 30 years for electrification to actually show up in productivity statistics for factories, because they actually had to lay out the factories differently.
So what happened with Redshift was that you got data engineers who still did ETL. Extract, transform, load. And they just loaded the data into Redshift. But they were still doing transformation and extraction in the same technologies that they were doing before.
(07:15) But the real paradigm shift for Redshift was not that you could do the final step differently and better. It was that you could do the whole thing differently and better. You could give the keys to the castle, to the data analyst, to do the whole thing. And it’s again, sometimes people get defensive, the data engineers in the audience. This is not a diatribe against data engineers. It’s just that there are actually two orders of magnitude, more human beings on the planet that can write SQL then can write Spark or Scala or whatever. So we should want to empower these humans. So ELT is really allowing data analysts to go upstream and do the, you extract the data from source data systems, you load it into, originally Redshift, but now Snowflake and BigQuery and et cetera. And then you transform it once it’s there and you transform it in SQL.
(08:17) What does data transformation actually mean?
(08:24) My favorite example of what data transformation is that we worked for a grocery delivery company. And one of the most challenging problems in that this company experienced was that they needed to calculate cost of good sold for their orders, and cost of good sold, every order was different. So the cost of good sold needed to be able to go down to the individual product skew level. So you needed to say, “What’s the cogs for one of those little bunches of green onions?” And it turns out that calculating the cost of good sold for a bunch of green onions was tremendously complicated. You relied on all this inputting cost data and how big were the bunches and all this stuff. And so this group of three or four of us would have these long conversations about, what does it mean?
(09:25) What does that even mean? Cost of good sold for green onions? And you then eventually get to a place where you’ve kind of sorted that out. You’ve defined what that means. And you save all that knowledge into one table or a small number of tables. And then literally nobody else at the business has to ever think about that again, that’s this really tremendously annoying problem that thankfully a small group of people can solve, and then if you’ve documented it well, and you’ve done your modeling well, everybody else can just kind of consume. So data transformation is really this process of taking this raw data and applying business context to it and creating these curated data sets that the rest of the organization can use as interfaces to the data or the knowledge that the organization… Without having to literally build up an understanding of how every single business process works from the ground up in order to be able to do literally any analysis at all.
(10:26) There’s been a lot of the excitement around dbt from both the market and VC investors based on the perception that dbt Labs, the company, and dbt Core, the project, own this transformation layer. Do you want to explain what dbt is and what it does?
(10:44) dbt is the T in ELT. I was just talking about how this re-architecture… So dbt does not ingest data into your warehouse, it transforms it once it’s in your warehouse. The funny thing about that is that if the data’s already in the warehouse, then the only thing that you need to do to transform that data is write SQL. And you can do that in a couple different ways. You can create a view that abstracts some business logic, or you can create a table that stores the results of a query, or you can incrementally update the data in a certain table.
(11:24) dbt allows data analyst, analytics engineers, data engineers, to write these small bits of logic, modular business concepts, and slowly build up a directed acyclic graph, a dag of these concepts. And you go from left to right, and you start at the source data, and you slowly build up all of these concepts, and you eventually get to a place where you’re dealing with business concepts that can be productively analyzed. And dbt is the framework that allows you to both express all of that in code, but then also to run it against your database and materialize all that stuff.
(12:11) I read or heard somewhere that when we’re thinking about it, is an abstraction layer comparable to Rails where instead of writing a bunch of things, you can just write one or two lines and through dbt, you end up very richly expressing what you meant.
(12:26) Yeah. Many of our careers and myself included did not go back to the nineties when people still wrote every web application in raw HTML. But that’s where internet programming started out, you wrote every single line by hand. And then you got web frameworks. And once you got web frameworks, you were never going to go back. It’s not like you were ever going to throw away the framework because it cut down the number of lines of code you had to write by, I don’t know, 75%, more. It’s a tremendous increase in abstraction and increase in productivity. And so I really think that, you look at the launch of Airflow in 2015, I think it was 2015, and it was this great kind of contain. It was just like, here’s a way to run a bunch of code on a schedule and like, well, what code?
(13:29) And the answer was, well, any code. And so people just started writing the equivalent of raw HTML, and that is fine, but it’s very low leverage. And so dbt is an attempt to start moving us up this abstraction level. And as a profession, data is generally, probably two decades behind software engineering in terms of the productivity of practitioners and the level of abstraction and everything. I wrote in 2016, this blog post, it was how to build a mature analytics workflow. And it was essentially saying all of these practices that have been matured over decades in software engineering, we just need to replicate them over into data, deployment processes and testing, and all of these different things. And the whole concept that you should work on documents that are, or documentation that is kind of native inbuilt into the codes so that it doesn’t get out of date, and all of this stuff.
(14:36) This was a novel concept. Back in 2016, the data practitioners were sending each other SQL files as attachments to emails, and that was the way that we worked together. And early stage VCs that I spoke to back in 2016, told me that it wasn’t at all clear that data practitioners actually wanted to learn Git. dbt was kind of a non-starter because it wasn’t clear that data people wanted to use Git. Fortunately, there has been this explosion of data tooling companies that over the past, especially over the past two years, that do more and more of this stuff. Honestly, at the beginning, it felt we were going to have to do all of it, which is why you see us do documentation and testing and deployment and everything. But it’s actually been wonderful originally. It was a little bit threatening because, oh my gosh, how are we going to fit into this new ever more crowded ecosystem? But eventually it’s been wonderful to have new folks join this party and realize that it’s going to require an entire ecosystem of vendors to recreate this kind of software engineering mindset.
(15:53) dbt for a long time was, still is, but was originally a very popular open source project that you built. I think you started RJMetrics while you were consulting. I think Fishtown Analytics, which morphed into dbt Labs was a consulting company. So it’s a popular open source project. You’re now a super well funded startup, and there is now a product called dbt Cloud, which is the commercialization effort around dbt. What does that do? And how do you think about it versus the open source project?
(16:34) The original thing that dbt Core did was it provided a language to express data transformations, and it provided a command line interface to actually execute them. We were out in the world actually doing consulting projects, so I was… The backstory with me and venture funding was that I had worked, prior to starting Fishtown Analytics, I had worked for seven years in three different VC-backed companies. I don’t know if any of you work at VC-backed companies, but it can be a reasonably high burnout environment. So I was a little bit burned out, and I was, no external capital, no external expectations. I’m going to fund this on revenue. And so we did that for three and a half years. We paid the bills via consulting. We at the time, the only thing that existed was dbt Core. And we clearly needed a way to operationalize this. We’re working with clients, we’ve got all these great jobs described, but you need to actually update data on whether it’s four hours or once an hour. It’s not twice every second.
(17:50) And so that was, we originally called it center. We didn’t even anticipate that it was going to be an associated commercial product, but it got more and more users over time. And what we’ve realized is that dbt Core presents this wonderfully concise surface area for an open source project. It allows you to describe what should be true about your data. It’s stateless, you write code in it. It kind of functions as a compiler. And then dbt Cloud is how you actually make that stuff true in reality. It includes a scheduler. It includes a metadata API to actually ask what is true about your production systems today. It includes an IDE to actually help you author this stuff. But this divide between, describe your data pipelines in code versus actually help me manifest them in reality is the core cloud split.
(18:56) The classic problem is, any organization of sufficient size has multiple different ways to analyze data. You’ll never get rid of spreadsheets. You will always have some kind of BI tool or multiple BI tools. You’ll probably have a notebook experience. You’ll always have multiple of these ways of analyzing data. And some of them have no governance layer at all. Some of them have a governance layer that’s bespoke to that particular tool. And so there’s this real need to take the governance. We were talking about with green onions, the cost of goods sold. There’s, what is revenue? What is orders? What are all of these business concepts? And so there’s this desire to push that upstream to dbt. And it turns out that, just the way that I was talking about before, how data transformation in a data warehouse context is just writing SQL. Defining metrics is just writing SQL.
(19:57) And so what dbt is doing is it’s taking all of this ability to write SQL really effectively with leverage. And it’s exposing that in an interactive context. So we’ve always been good at this batch based context. Now we’re building an interactive context where a user in a BI tool, or in a notebook, or wherever, can say, “Hey, I want revenue. And I don’t actually know how to write the SQL to get revenue. I’m just going to ask you for revenue.” What dbt’s going to do is it’s going to actually rewrite that query. It’s going to get the canonical definition of revenue. It’s going to execute that against the warehouse, and then bring the results set back. Then that layer is going to sit in between the BI tool and the data warehouse for all those different BI tools so that you can present a consistent view of those metrics to every user.
(20:51) Where do your ambitions start and stop in terms of roadmap for the next couple of years?
(20:57) The thing that is neat about the position that we’re in right now is that we get to ask the question, “How should all this stuff work?” Not what is the one piece that we can build, but, oh gosh, we actually have a lot of people using this thing. And that gives us an opportunity to say, “Let’s build something that maybe no one’s actually been able to build before.” One of the nice things about dbt is that it allows you to create this map that spans the entire graph of computation inside of an organization, from the data landing in the warehouse, all the way through to people using the data on the other side. But dbt actually understands, “Hey, this is a data source. This data’s coming from Fivetran.”
(21:45) And it knows, “This is a data transformation, it’s executing on Snowflake.” Or, “This is a Python based data transformation, it’s executing on Databricks.” And then, “Here is a Looker dashboard that’s querying this table,” et cetera. So anybody in the data ecosystem that’s building a product or in-house tooling, can query this API and say, “Hey, tell me the state of my data.” You can ask questions like, “Is this data source outdated?” Or, “Does this transformation power a downstream dashboard?” So one of the things that most of the practitioner space in the dbt community doesn’t actually understand is that the dbt Cloud API is now powering dozens and dozens of partner applications, because it turns out this knowledge is really, really critical.
(22:41) As we move forwards, we’re not looking to own cataloging or own whatever, these different categories. We’re looking to be the infrastructure that powers this ecosystem, because it turns out that you don’t actually want to connect to four different competitive metadata API. You just want to plug into where all that knowledge sits. There’s no way in the world that Apple was going to build every experience on the iPhone, but they had to build some of the foundational ones, and the APIs such that this innovation ecosystem could bloom. If you didn’t have the app store, then all of the downstream innovation wouldn’t have happened, because you actually need to get people to a place where the amount of work that needs to be done to create an app is constrained enough, such that it can be economically done by enough vendors. So our goal is actually to continue to make it easier and easier to innovate and solve these problems. And we’re helping to build APIs to make that happen.
(23:49) Audience question: (23:58) I get the impression that dbt’s pushing the idea of SQL first when you think about how you write your data transformations, which feels at odds with trying to build abstraction layers on top of SQL, because with dbt, you compile your SQL and you hope it’s valid code that runs against your warehouse.
(24:15) We’ve become very well identified with SQL maximalism, and that’s not actually the point of view. The point of view is one, the persona that we care so much about primarily speaks SQL. And two, we really believe in bringing the code to the data, and not the data to the code. And the data environment that we started in was the data warehouse. And so that was an environment that spoke SQL. Now, data warehouses are now moving towards supporting multiple languages. We really do think that the future of data processing is polyglot, and I think that if you look in five years, you will find more robust abstractions on top of data, and even in the dbt ecosystem, than SQL. That’s not me making product roadmap statements, but I think that’s the direction that things are moving in.
Audience question (25:17) What workflows should people not use the modern data stack for?
(25:20) Right now, what is typically known as the modern data stack, you would be correct in saying that is not that well identified with the machine learning data science part of the world. And I think that that’s for a bunch of historical reasons that don’t necessarily have to be true in the future. But I think legitimately, if you look at the main processing platforms of today, inside of the modern data stack, they have their roots in data warehousing and not in ML. And so it will take some work to plug these things together. Again, if you look in five years, I think that this distinction will have been sanded over and will not be salient anymore. But I think that today that’s still roughly true.