As we close an incredibly active year in the world of data infrastructure, it was a particular treat to host at Data Driven NYC two of the most thoughtful founders in the space, for an in-depth conversation about key trends.
Tristan Handy, is the Founder & CEO of Fishtown Analytics, makers of DBT. DBT is one of the most popular, open-source, command-line tools that enable data analysts and engineers to transform data in their warehouse more effectively. Based in Philadelphia, the company raised both a $12.9M Series A and a $29.5M Series B, back to back in 2020. Tristan also does a great weekly newsletter, The Data Science Roundup.
Jeremiah Lowin, Founder & CEO of Prefect. Prefect is the new standard in dataflow automation, trusted to build, run, and monitor millions of data workflows and pipelines. As another leader in the open-source world, Prefect powers data management for some of the most influential companies in the world.
We had a wide ranging conversation, covering lots of topics: the modern data stack, data lake vs data warehouse, empowering data analysts, workflow automation etc.
Video and full transcript below!
As always, Data Driven NYC is a team effort – many thanks to Jack Cohen for co-organizing, Diego Guttierez for the video work and to Karissa Domondon for the transcript!
FULL TRANSCRIPT (lightly edited for brevity and clarity)
[Matt Turck] I’d like to start this conversation at a pretty high level with this concept of the modern data stack, which has been around for a little while now but has really been going mainstream over the last couple of years. Do you want to take a crack at defining what that is?
[Tristan Handy] I’ll talk about it from our perspective, but I’m actually interested, Jeremiah, to hear how much your answer differs from mine. When we talk about the modern data stack, we think about what’s really four layers. There’s data ingestion, there’s the data warehouse, there’s data transformation or taking all that raw data and turning it into something valuable, and then there’s data analysis, which could be BI, or Notebooks, or whatever.
Those set of technologies have really been completely rebuilt, I think, over the past seven years. Really, it was the introduction of Amazon Redshift in 2013 that sparked a re-write of all of the products in that space. In that timeframe, you had Looker get founded and be sold to Google. You have Fivetran. You have a whole set of products at every layer of that stack in there, primarily, but not exclusively, used for descriptive statistics like classic business intelligence, like what is going on in my company today? And I think the thing that I’m really curious about, Jeremiah, is do you use that word, the “modern data stack,” in the same way that I’m describing it? Or do you think about yourselves as a part of a different version of that?
[Jeremiah Lowin/JL] That’s a great question. To be honest with you, selfishly, I think one of the things I’m very excited about this conversation is you and I are swimming in adjacent lanes of the same pool, but we’re not obviously doing the same stroke with the same objective. There are many things that you just said that absolutely come to mind for me, but I actually do think about it slightly differently.
I think what you just described, I completely agree, is the stack piece of it. And what I’m very focused on is probably the modern piece of it and what that’s come to mean as a differentiation. The two are both critical. I’m not trying to say one over the other. I fully agree there’s this layered thing, this data stack that’s emerged, but to me when I think about what makes it modern versus what it used to be is a couple of competing tensions or frictions.
One of them is a set of standardizations, mostly in our world, in the PyData world, and in the open-source world, data science, and machine learning. But in addition, we see this vast experimentation, and so we see this proliferation of edge cases and exploratory analyses. I think that one of the things that links this all into a cohesive modern data stack is the tooling that allows someone who simply wants to take a very specific data extract and run some cutting edge, or just invented on the spot algorithm, on it can do that in a way that integrates natively with their entire company’s data backbone all the way back to the ingestion.
And so it’s the idea that the layers of this stack that Tristan just described are no longer siloed but have actually embraced a set of interfaces or interoperative standards that have allowed people to participate without only existing in, just for example, the BI layer. If we think back to BI tools, say, 5, certainly 10 years ago, you would beg for a CSV. And God help you if the insight you were looking for was not in that CSV you were screwed. But now, thanks to great tooling, and insight, and standardization, and education, and communication, that person is not empowered to actually potentially go directly into the data warehouse if desirable or interface with the folks who can provide that. So to me, I echo the layers, and I think, also, about the way that they actually very smoothly work together.
[MT] To double-click on this, when you mentioned parallel universes, is one way of thinking about it that you basically, in that data world or system you have two families, I guess, or two categories? One is the world of analytics, meaning BI where you have this stack, where indeed you have the Fivetran’s of the world, we had George speak at this event a couple of events ago, and then the world of data warehouses. And then after the data warehouse, you have the Lookers of the world, so that’s one world. Then another world is more of operational, transactional data, in particular machine learning and AI-driven applications, that start with a data lake and it has a different type of workflow. Is that fair? Is that what you guys see?
[JL] I’m not sure if I disagree with the landscape you just outlined, but I’ll say it just how we think about it. We found it instructive to think about data engineering-centric workflows versus data science-centric workflows, and immediately that’s a controversial broad brush to paint with.
What we mean by that is we mean, for lack of a more general – as pipelines are built, or as tooling is built, or as workflows are built, is the principal item of interest the status of the job? Did it succeed or fail? Or is the principal item of interest data being transformed through the pipeline?
We tend to think of the data transformations as data science-focused, and we tend to think of the state things as data engineering-focused. The reason that’s a bit controversial is that ETL, in many forms, is therefore, a data science-centric activity in that philosophy of the world. But nonetheless, we’ve found it’s a really useful way to categorize software tooling, and who consumes it, and for what purpose. So I won’t swear that it’s universally true, but that is how we separate.
So when we think about BI tooling, typically, we’re dealing with data transformations, and we put that in data science, or perhaps you could say the analytic world. When we think about orchestration or we think about automation, often we’re dealing with the status of something and the state of that thing, and we put that in the data engineering world.
[Tristan Handy] The line that we draw there is, and I don’t know that we spend a ton of time thinking about this, but I think it’s implicit in a lot of what we do, is about the real-timing requirements of what you’re doing. There’s a lot of data science that’s done internally, where you’re doing predictive analyses that the data latency can be a day, or it can be last quarter’s data is fine. And those data science workloads are often run on the exact same stack that I was talking about before, but the minute that you’re serving recommendations in a public-facing application and you need 100 millisecond or less response time, that is a very different thing.
That’s how we slice. Today, the DBT-focused stack can do that former thing, and it can do it extremely well. We don’t even attempt to think about that latter problem.
[MT] Okay, interesting. Different ways of clustering the data landscape.
[JL] I think not only is that so interesting, but Matt, you’re probably the person most responsible for making people have to draw these lines because you’ve got this map of thousands of companies on it and trying to categorize them. I mean, I think if you ask 10 people, you’ll get 10 different splits of that landscape. I really do.
[MT] To continue making this approachable for a bunch of folks that look at different levels of depth in that space, could one of you take a crack at just a high level on data warehouses? Because it’s been the year of Snowflake, and a lot of the excitement around the modern data stack is really around data warehouses. Could one of you do two minutes on why it’s a big deal? Why is Snowflake a big deal, why is BigQuery a big deal, and how that has changed things?
[Tristan Handy] I just gave a presentation on this. This is a topic I love so much. For the folks watching this, I am not a data engineer. I am not a data scientist. My career history is a data analyst, so SQL is my second language. Up until the release of Amazon Redshift in 2013, I didn’t work at a large company that had a Teradata license or a Netezza box in the server cabinet somewhere. There certainly were a small number of companies who paid very high license fees for those products, and they had great experiences, but most of us were using transactional databases to do analytic workloads. And that does not work very well.
There might be a large number of people who have tried to write an analysis on MySQL and found that’s not a wonderful experience. Amazon Redshift, in 2013, for the first time, released an OLAP database, a database designed for large scale query processing that you could purchase for $160.00 a month. So everyone could use it for the first time, and that literally fundamentally changed how you thought about doing analytics at every company that did not previously have a vertical license.
I know that’s a technology-centric answer to that question, but the same way that EC2 or S3 has changed the way that we think about not just building applications but then building startups around those applications.
[MT] And maybe one more word on elasticity. What’s fundamentally different other than the price? What is the big advantage?
[Tristan Handy] Yeah, so it addresses the low end of the market in terms of price, but then you can just start there, and you can just stay there forever. There’s no ‘too big’.
[MT] Do you want to explain the difference between a data warehouse and a data lake?
[Tristan Handy] A data lake is a bunch of files in some particular file format. It can be any one of a large number of formats, including CSV, that is just shoved into an object store. And then you can take a compute engine and run data processing on top of that object store. A data warehouse, ultimately, is the same. If you want to describe what Snowflake is, Snowflake is also that, but Snowflake is a very tightly coupled way of doing the file storage, of doing the indexing, of doing the processing, and so it is very, very good at certain types of data processing use cases.
I would say that probably where we are today, the data lake can do anything, but it also probably takes more work to do anything. Whereas the data warehouse has a more constrained set of use cases, but it is much easier to get up and running for those constrained set of use cases.
[JL] I think there’s an analogy to be made, although it’s a dicey one because we’re already in database land, but I think there’s an analogy to be made with some of the trade-offs of the NoSQL database with SQL database where folks who’ve done it before would say, “You’re going to have a schema. It’s just a question of whether you define it upfront, or you figure it out later.”
And I think the data lake emerges a way to defer as a consequence of what Tristan just said because most classic data warehouse providers require you to specify or at least know how you’re going to access your data. Thankfully, not quite as terribly as maybe a transactional database, but you still have to know how you’re going to access, how you’re going to store it.
The data lake is a way to say, “Oh, we’ll chuck it over there. We’ll figure it out later.” But as we all know, at some point, you have to standardize your access to the data, and it’s really a question of when, I think, not if you do that.
[MT] Great. Alright, well, those are great level set. I’d love to spend a bit more time separately on what you guys are doing. Tristan, starting with you, talk about DBT and the evolution of ETL to ELT and where that fits in a data warehouse-centric world.
[Tristan Handy] The history, and I didn’t personally participate in a lot of the pre-version of this, but the history of data processing in data warehouses is that you tried to do as much of the data processing before the data got to the warehouses as possible. Data warehouses were constrained, expensive. If you wanted another Netezza box, you order it a month ahead, and it gets shipped to your environment. It’s not elastic.
Once you have infinitely scalable compute and storage inside the data warehouse. It seems small, but it is actually, I think, a very important point. Once you have SQL that can actually describe all of the things that you might want to do, which only actually happened between 2008 and 2012, specifically the window function standard as a part of the SQL 2008 revision made it such that you could now express all of the stuff you wanted to do in SQL.
And so what’s happening now or over the past five or so years, the transformation step now happens inside the data warehouse, and it happens in SQL. And because of that, it is now accessible to a dramatically larger number of people. You no longer have to have a PhD in parallel computing to participate in this process, which is ultimately like the distillation of knowledge at your company.
You take all this raw data and you transform it into something that actually has business meaning, and if you lock that up behind only people that have degrees in computer science, it’s a fairly small group of humans.
[MT] In layman’s term, what is actually a transformation? What are some examples of operations that one does?
[Tristan Handy] The one that I always like to talk about is we have a self-service SaaS product. We use Stripe for our payment processing. Stripe gives us invoices. The invoices have start dates and end dates. But if you just add up all of the invoice totals for a given month, you don’t actually get monthly recurring revenue.
In order to get monthly recurring revenue, you have to take your invoices and “amortize” them over the period that they are for. So if you have an annual subscriber, you need to recognize that revenue over the course of 12 months. And so a data transformation, and it’s actually a reasonably complicated one, is to take your invoices and amortize them into monthly revenue such that you can just have this table that every month you just add up the numbers, and that gives you your monthly recurring revenue.
[MT] A big theme here is the rise of the data analyst, as you mentioned. What is a data analyst compared to a data scientist or a data engineer? What should they be able to do, and how technical are they?
[Tristan Handy] That is the question, isn’t it?
[JL] That’s a complicated question.
[Tristan Handy] Here’s my personal opinion. Don’t throw tomatoes because I know that everybody has their own thoughts on this topic. I think that a data analyst is somebody who answers business questions with data, and they frequently will have a business or econ degree. They are interested in solving business problems, but they’re also not afraid of technology. And they often have learned all the technologies that are required to do a good job of answering those questions, whether that’s sometimes Excel, sometimes SQL, sometimes Python or R. But they don’t self-identify as technologists. They learn technology in the service of answering these business questions.
[JL] I like that definition.
[MT] It is a great definition. So they need to know SQL, they need to know a little bit of Python?
[Tristan Handy] If you want to take me as a representative, I know SQL very well. I know Python well enough that when I have a predictive problem that I run into, I can hack my way there, but I could not build a production application in Python. And I think that’s reasonably representative.
[MT] Is the underlying philosophy of DBT to help data analysts think like software engineers?
[Tristan Handy] Yes, that is completely true. I think that what we are really trying to do is widen the circle of people at an organization who can participate in this process, the process of creating new knowledge. We’re trying to empower the data analysts to be first-class owners of that process as opposed to the previous experience was, I’m doing some analysis and then, oh, crap, I realize that I need some new dataset. And I need to file a ticket with somebody in data engineering who will eventually put it on their roadmap and get back to me when they can. And that is the death of the data analyst as a productive member of your team. They will get frustrated. They will leave. They don’t have great career paths, all of these negative outcomes.
[MT] And just a word on DBT. You guys started in 2016?
[Tristan Handy] Fishtown Analytics, my company that is the maintainers of DBT, we started as a consulting business, and DBT was the tool that I wanted to be able to do this work. And I honestly didn’t know how many data analysts wanted to work like software engineers. And it turns out that over time, we’ve built this community of over 8,000 people today who have bought into this as the way that they want their careers to look. That spawned a software offering. Originally, I thought that it was just going to be me and a couple of other folks who were using this thing.
[MT] Yeah. And by the way, fast forward to today. Congratulations on announcing your Series B that closed three months after the Series A, which is a testament to the overall excitement about what you guys have been building.
[Tristan Handy] It’s been a very interesting year for us. There’s a lot happening.
[JL] That’s awesome. Congratulations.
[Tristan Handy] Thank you.
[MT] Jeremiah, that’s a little bit your story as well, right, that you started building something for yourself, and then, at some point, decided that actually other folks may want to use that, then it could be an open-source project. Is that the story?
[JL] Yeah, it’s a version of it. And I was just thinking just when you were speaking. One of the things I believe so strongly, forgetting the data side a moment, but when we think about just startups, or companies, or whatever it is, is solving a real problem and not an invented problem. And if I’m advising someone who’s trying to start a company or something, so often you get pitched a solution, but if you asked people if they have that problem, I’m not sure they would recognize it.
And I think one of the things that we feel, certainly that we feel at Prefect, and Tristan, it sounds very much like you did with DBT at Fishtown, is we were solving a problem that you had, that I had. My background is as a data scientist and as a machine learning researcher, and more specifically, in risk management in the investing world. And I had a series of problems that were primarily related to a breadth of work, so across multi-asset portfolios, across many stakeholders.
Many of them were analytic, some of them were not. Some of them were so much like boring, and rote, and straightforward, and some of them were incredibly complicated, and unproven, and experimental. Prefect was an attempt just for myself to solve those problems in a cohesive and unified way. And most importantly, because I mentioned some of the PyData with the system earlier, and I’m very much a believer in that. Most importantly, it was a tool that would allow me to continue to use the best tool for the job. In other words, I didn’t want to replace the tools that I used as a data scientist.
I’ve been using … I’m a contributor to Theano, actually, it was in the news recently because I think the PyMC3 folks are going to revitalize it, which I think is awesome. But I’ve been doing this for a long time with amazing tools, and I didn’t want to replace those. I didn’t want to rebuild them. I wanted to make sure they were all doing what they were supposed to do at the right time, and that’s where Prefect came from. It wasn’t until a bunch of other folks I worked with were like, “Hey, we would use that. We would pay for that.” The light bulb went off, and we went off to find the right way to deliver it. And the key answer to that is open-source as a way to deliver, distribute, and collaborate on software, but there’s definitely an echo of that. It was a selfish problem, to be completely honest. Now, I am not the person. I have an amazing team now at Prefect, who has advanced this so far beyond what it was back then. That’s not me back then, that’s just the initial impetus that, “Hey, there’s a thing. It has a name, we can describe it. We can solve it. Let’s move forward with that.”
[MT] Great. Where do you sit in that overall ecosystem that we were talking about? I guess maybe double-click on the problem you’re solving. What is workflow automation? What does scheduler do at a high level?
[JL] Yeah, I think the principal problem that we solve, there’s two aspects to this. There’s a problem we solve, but before we can get there, let’s just talk about what it looks like. Do I understand where this problem comes from? People write code. They put that code into the world to do something, to take some action, to have some side effect, or produce some result. And you build enormously complex and beautiful applications as a consequence of putting all of that together.
But making sure that happens at the right time, in the right order, with the right dependencies, in the right environment, in the right cluster, I mean, on and on, and on, and on, is a complicated problem. And often, we refer to that as an orchestration problem. And just to hammer it home, I think it’s a very on the nose metaphor for it, but you think of an orchestra. Right? The orchestra is a collection of individuals who are playing music. There’s a composer, there’s an arranger, there’s the musicians, there’s a conductor. And again, we have this idea of everything needs to happen at the same time, the right time, the right place, the right note, everything like that.
That is actually not the problem that we think we solve, though, to be honest with you, because the idea of just running code at the right time and the right place, I mean, we can go back probably 40 or 50 years and find pieces of software that claim to do that with varying degrees of efficacy. We do do that with Prefect. We do automate workloads. We do automate code as a consequence of what we are trying to do.
What we’re actually trying to do is we’re worried about the case where the auditorium is burning down, or when the first violinist is sick, or when it’s too hot in the room, or when there’s a pandemic and the audience can’t attend. And what I’m describing is my … again, I mentioned I’m a risk manager for however many, one, two decades, or whatever it’s been.
The problem that I always had with my code was not getting it to run. That was easy enough. I would write a script. I would use a great tool. I could use something like DBT to make sure that I’m doing something with best-practice or getting out there. My problem was actually when I would wake up in the morning and not know if an error had taken place. And I would spend hours trying to figure that out.
Today, we work with clients who have saved hours, and hours, and hours of time, not by writing a single line of code using Prefect, but because they log in, and they look at a dashboard, and they see red and green lights. And they just know, “Oh, something needs to be restarted. We have to go figure something out. Something broke.”
That idea, that’s why we frequently describe our product as an insurance product because we deliver value, mainly when things go wrong. And this will sound crazy, maybe to say, but if everything goes the way one of our users expected it to go, they really don’t need our product. It’s a very weird product positioning for us to have that belief. If things work the way our users expect, they don’t need our product.
And again, the insurance metaphor is very useful there, and so we’ve named this problem the negative engineering problem because it’s not about what you’re trying to achieve, it’s about all the defensive work you have to do to make sure that it actually, in fact, took place.
[MT] How do you compare and contrast this versus Airflow and other products in the space? You had a very well-written and successful post on the topic, curious about how you position.
[Tristan Handy] Jeremiah, how does it feel to be in a category with Airflow, where everybody always wants to know how you compare versus Airflow?
[JL] We get this question a lot, actually. It’s a funny question to get because I am a PMC of Airflow, and I gave many years of my life to building Airflow. And if I were able to do the things that Prefect does in Airflow, I would’ve. The funniest part about getting that question … So obviously, Tristan, you’re 100% right. The funniest part about getting the question so often is that Prefect was actually developed very literally to do things that I was unable to achieve with Airflow. Right? It was supposed to go off in a different way.
And now it happens that our functionality is a superset of Airflow’s functionality, and so there are many things, and we’re actually very pleasantly surprised to discover a lot of people are like, “Oh, Airflow is too hard. We’re going to use Prefect instead,” or however they show up at Prefect. But the primary set of use cases that we originally started with were to borrow those data engineering metaphors that are expressed in Airflow and actually deliver them to data scientists like myself, who otherwise lived in a world where there was no concept of a retry. There was no concept of state-based dependency, or anything like that.
I mean, we’ve written this blog post. It’s called: “Why Not Airflow?” Why would you, as a data scientist or someone using a modern data stack, not choose to use this well-known tool that I gave a few years of my life to build? And I can refer folks there for the answers, but in all honesty, it’s a very pleasant surprise to have to answer that question because our raison d’être is all about doing a whole different class of things around data science, machine learning, analytical workloads.
[MT] Great. Alright, as we start getting questions from the audience, I want to highlight some, and then I’ll come back. Jack, do you want to read that question.
Jack Cohen: Yeah. Yeah, absolutely. Yeah, sure. Hi, everyone. I’m back. We have a great question from Eric, who is actually a speaker at our Code Driven event several months ago. His question is: What do you think is the biggest thing data engineers, scientists, analysts waste time on, and where there’s an opportunity to build tools to address that?
[JL] Is that directed to anybody? I mentioned this earlier, but I’m more focused on the case where the theatre is burning down. And the reason we’re focused there is if you imagine saving somebody one hour, and Eric, I think Eric, in particular, will know this well, so maybe my answer is directed at others who may have the same question than him, in particular.
If you imagine saving someone one hour orchestrating something, just getting that code to run versus saving a team one hour in a production crisis incident, the leverage of that hour, it’s the same hour, but the leverage, the impact, the emotional burden, the fact that revenue is at risk versus just the fact that you’ve got stuff to go is so wildly different. One of the reasons we focus there is that even if we only make a small impact, even if we only save that one hour, the compounded effect through an organization of bringing, we call it time to error, but perhaps that’s not meaningful outside of Prefect.
But by reducing that time to error, that time to discovery … I mean, so much of our product is around that one moment. And if you log in, and you see a bunch of green lights, see you later. You shouldn’t be using our product right now. We have nothing interesting for you, potentially. My team probably is angry that I said that, but it’s part of how we design it.
If you come in and something is wrong, and it’s something that’s important to you, our job is to deliver that information as quickly and concisely as possible because the number one frustration that I have certainly felt and many others is, “Oh, God. Something failed, but the cluster tore down the node and the logs weren’t properly archived. And I actually can’t find out what it was, and I can’t recreate it. And I don’t know when it happened.”
And so it’s really cutting down that at that moment of maximum anxiety. Delivering the confidence and clarity that we can is where we think we have the greatest leverage as a product, even though it may not be the single greatest waste of time of any individual. It’s an interesting distinction, and again, from our insurance mindset.
[Tristan Handy] The data analyst is that highly cross-functional role. They have essentially no organizational power. They exist to dive deeply into questions. And even that, they need to do in a cross-functional way and get support from technical folks to get datasets and all this stuff.
The biggest problem in their workflow is getting blocked. It’s like, “I have a thing on my to-do list, and I can’t make forward progress on it.” And there’s a million ways to get blocked as a data analyst. The one that we’re trying to really focus on is the technical one, being able to own the technology part from beginning to end. Even once you wholly eliminate that, you have classes of problems like, “Well, I need to add instrumentation, or I need to actually make business process changes as a result of the thing that I already told you last month. Until we do that thing, then none of my other work really matters.” If you take a team at a very good, data-forward organization, a team of 20 data analysts, my guess is that they spend 50% of their time blocked.
[JL] I can attest to that, certainly.
[MT] Cool, alright. Let me switch to another topic, and then we’ll take questions. We have over 200 folks in attendance. I want to leave plenty of time for questions. One topic I wanted to be sure we touch upon is building an open-source company. Both of you are building your companies with a very strong open-source component and a commercial product on top. I guess how does that work, and how do you think about what needs to be in the open-source and what needs to be in the commercial product?
[Tristan Handy] Can I answer that first because my answer is going to be a non-answer? We’re still early, and I don’t know. It’s a very, very hard question, and I think that it is something that … The entire B2B software industry has embraced open sources like this wonderful formula. But the last generation of open-source companies used a particular model. In a lot of ways, I think the industry has said, “No, we’re looking for new models.”
We certainly have things that we are doing and that we’re testing out, but we are still very early as a company. The thing, Jeremiah, that I’m interested in hearing from you. I heard you on a podcast, Invest Like the Best. You talked about this insurance concept, and I love it. It’s been very instructive in my own thinking. I’m curious how uncertain… when you presented that it felt like you were just like, “I know all the answers. We’ve answered this monetization question.” My guess is that’s probably not as true as that.
[JL] Yeah, I think without saying the opposite of what you just said, what I think that you are hearing is we’ve spent a lot of time thinking about exactly this question that Matt just answered. I’ll give you what I think our answer is. One of the reasons we spend a lot of time thinking about is coming from a non-traditional place and … I mean, look, there’s a lot of overlaps between the story Tristan just told and the story that I’m about to tell, although we have gone in different directions. I’m not trying to say that we’re mirror images of each other here. But coming from a non-traditional place outside of Silicon Valley, there’s all these things where I think we’ve thought about things in a little bit of a different way.
One of the reasons, as I mentioned, that we started a company and even introduced an open-source project is we just had this knocking on the door of product-market fit. And so we spent a lot of the time when we were a seed-stage company building the company that we wanted to have in a few years, and investing in processes and business operations, and leadership training, and management frameworks, and these things that sound ridiculous. And I’ve had plenty of people tell me that we were ridiculous to pursue them at that time. But I think what you were hearing from me, Tristan, in that moment, is I have a decision-making framework that I really like, which allows me to have great confidence in things under limited-information sets. This is, again, this is all a skill set that I developed principally as a risk manager, where my job was to make decisions about extremely unlikely and uncertain information sets. One of the things that I’ve tried to do with Prefect is take some of that philosophy and those ideas into how we steer our company. And so what you were hearing, I think, in the podcast is, “No, I don’t claim to have all the answers by any means.” What I have is … What’s the phrase? Strong beliefs, weakly held, I think, is one version of it.
We codified that into how we can move forward. One of the principles we have in our set of standards is saying I don’t know. And if someone didn’t say I don’t know frequently enough, that would be very disturbing to us at Prefect. That would mean that we hadn’t tried enough things with some degree of conviction to learning that we were wrong and failed completely. And so I think that’s a little bit of what you were probably hearing from me there.
As far as the split, though, commercial and open-source, we didn’t require our open-source to be successful for our company to be successful. In fact, in that podcast, I think Patrick asked me at the end, “It sounds great. Shouldn’t everyone be open-source?” And I go like, “God, no.” It’s really hard. I mean, if you think about the number of successful companies that have actually iterated on this open-source, the number is very, very, very small, and it’s especially small relative to the denominator, which is much larger because the cost of entering that market is so low.
I think that the book is yet to be written on the successful commercialization of open-source. One of the things that we have tried to do at Prefect, and we admire extraordinarily, what’s happening in the DBT world here, is we’ve tried to use the fact that we seem to have built something people truly enjoy, and I think that’s because it really solves a real problem that they really feel, to create a flywheel that benefits all of our stakeholders so that our open-source users can benefit from work we do for our commercial customers. And in turn, our commercial customers, we know that they tend to come from the open-source. That’s their value discovery.
There’s obviously more dimensions than that, but we end up with this symbiotic relationship, two separate businesses, but nonetheless, intertwined. And we have to make sure that both of them are healthy. And so while we don’t require … Well, at this point, to be honest with you, the Prefect open-source project has taken off to such a degree that we do require it to be successful now. It is a huge part of our company and what we do, but we didn’t require that from day one because our goal was to build a company and solve a problem.
[MT] Is the idea that the commercial product should do something different as opposed to being, for example, a managed version of the open-source product?
[JL] I’ve been referring back to this podcast. This podcast was only about this topic if anybody is curious to hear my opinion there. As you can hear me saying there, I think that is a bad business model, and I’m on the record of that. And for a very simple reason, which is that a company that is solving a problem needs to find the correct way to express its knowledge and its solution to that problem, and running software literally cannot be … unless you are a company that sells CPUs, and there are public laws for that, that is not the expression of that drive.
And so I think if you look at successful commercial open-source companies, you will see that their products are not just managed versions of an open-source product. They’re truly adding some layer of something. The best ones are actually completely different products from their open-source, such that … we like to think in our world of an engine in a car. We have an amazing engine, but we can spend all of our time building a Ferrari and if somebody just needs that engine and they want to drop it into whatever car they built, that’s amazing. That’s great. And if somebody wants to come over and get our seats and our steering wheel, we have that as well. But if all we were doing was releasing a car and then having a slightly nicer car, man, we better be damn sure that people want the chrome taken off the windows or whatever crazy little option they have.
I think that if you do not recognize what the actual problem that you can solve commercially versus the actual problem you can solve open-source, if they’re not sufficiently different, I think you end up in a place where you have a perverse incentive to take advantage of your open-source community or monetize them in a brutal way and in a way that, frankly, people aren’t dumb. They realize this. The name of the game is transparently, and openly, and honestly to communicate what value you’re able to deliver as a commercial entity in addition to or next to what value you can deliver in an open-source form.
[MT] Great. Alright, let’s take one audience question, and then I’ll have more.
[JC] Alright, great. Let’s see. We have a bunch of great questions in here. Let’s do this one. Tristan, this is from Ned. Do you have advice about using DBT with tools like LookML and Cube.js that also do in-warehouse transformations but are geared towards the last mile of modelling for analytics consumers? Are there good ways to think about the boundaries between the two?
[Tristan Handy] I’ll answer for LookML. I’ve not actually heard of Cube.js before, and I am saving that in my browser right now. LookML does two things. Batch-based transformation happens before the analysis, and then there’s analyze time transformation. DBT and Looker are used together all the time to very great effect. We’re Looker partners. We ourselves, internally, use DBT plus Looker. The part of LookML that does batch-based transformation is not the primary focus of their product. It’s not the core innovation that they have. And so we use DBT to do all of our batch-based transformation, and then we use LookML to do all of the analyze time work.
[MT] There have been some good questions in the chat, so maybe let’s try some more. For example, from Bill, how do you feel about tools such as GraphQL to simplify and broaden the applicability of the data service that you publish?
[JL] We expose a GraphQL API as the Prefect API, and it can be great. It can also open the door to problems because, in some ways, you are exposing a great degree of flexibility. We work with a company called Hasura, who we’re big fans of, to build that, but you can put people in a position. On the one hand, you can offer people, “Look, you want to sort, you want to filter, you want to query, you can do whatever you want. It’s your data.”
But on the other hand, you’re inviting people to query a dataset that you may not have optimized for the access pattern that they have in mind, and that can create some complexity and user frustration. And so what we’ve found is that it’s fabulous for exploratory work, or ironically, well-defined queries, which is the opposite of the purpose of a GraphQL.
But increasingly, we’re now moving toward … We’re still on GraphQL because we like the syntax, but increasingly, we’re moving toward more tightly scoped views of the data in order to make sure that we can deliver. As we start to have people that are creating millions and millions of items, we want to make sure that we’re delivering the proper performance. Sometimes we have to sacrifice flexibility to get there.
[MT] Great. And another question I like, and I was actually going to ask a version thereof, which is really about the flip side of empowering data analysts and I guess the rise of automation, I guess, which is both things you guys do. The question from Niall is “As data analysts are empowered to own more and more of the data journey, in what ways do you see the role of the data engineer changing? What will the engineering-aligned members of a data team be doing? And my question, will they be automated away, or what happens?
[Tristan Handy] I think if you look at the history of self-service, you never find technical folks who are just like, “Oh, no one needs me anymore. My skillset is useless.” I think that there are so many problems in this space that data engineers shouldn’t want to be spending their days expressing business logic like how do you amortize revenue across multiple periods.
They are technologists. They should want to be thinking about platforms, and scalability, and all of these hard technical problems that, honestly, are underinvested in right now. I think that if there’s a single piece of writing on this topic that I still think is gospel, there’s a guy at Stitch Fix who wrote a blog post called, Engineers Shouldn’t Write ETL, back in 2016. And I still think that is the best point of view on this topic today.
[MT] Before we take another audience question, I’d love to chat a little bit about, I guess, emerging trends in the space. The setup being what we’ve discussed so far, which is, alright, the modern data stack, and there’s the machine learning stack. A lot of those things started in 2012, 2013, 2014, and have already accelerated over the last couple of years, and they’ve become mainstream.
I guess, what are the cool kids like you guys doing? What are they thinking about? Are there themes around … We talked about real-time a little bit and streaming. It’s something that people have been talking about it for a while. Is that happening now – governance, self-serving BI, all these things? What are your favorite emergine topics?
[JL] Okay, sure. First of all, no one has ever referred to us as cool kids before, so thank you for that. That’s great. That’s nice to hear. In one of the earlier times we announced Prefect, it was like, “Well, this is the hot new workflow manager.” I was like, “That’s a phrase that’s never been said before.” It’s great that we can get people to feel that way.
What do we see happening? We’re actually really thinking about this. We’ve been thinking about this all year, actually, because we thought that we were putting something into the world that was the next step towards the machine learning, data science world that I come from that I feel is underserved from a production and engineering standpoint, but has this amazing ecosystem of tooling. So we got all these happy users and happy customers, and then they brought us this new set of use cases, where they’re clearly heading from a data analytics standpoint.
It made us realize we’ve pushed this batch, DAG, static workflow probably as far as it can go from YAML and then into Python, and now with flexible, for instance, like Prefect. And now we have this class of use cases that involved server-less, and runtime discover, streaming, which is a very loaded word. And I echo what Tristan said earlier. I’m not necessarily talking about millisecond latency streaming. I’m talking about purely event-driven. Runtime discover needs to take actions and dynamic edges.
We introduced a feature about a year ago. It’s mapping, and to anyone in the analytics world, mapping is no big deal. But for folks in the orchestration world, this is actually a mind-blowing feature, and it’s by far the most popular feature of Prefect. And we can see that in our data image because it solves a real problem that I experience as a data scientist, which I have a bunch of things I need to work with, but I don’t know about it until runtime. And so, for about a year and a half, this has been the most popular feature we’ve ever built.
It turns out, as soon as you introduce this form of dynamism to people, they’re like, Oh. Well, we could do all kinds of other cool things, recursive things. We can introduce cycles into our graph, which, for dive-based workflows, is forbidden. We can have sidecars, which is where something is happening while the graph continues to process. Again, in our dive-based world, that’s forbidden.
What enables this is by bringing in API into play. And so, one of the things that we’re super interested in is the degree to which API has become a way to move that business logic of the workflow into a more central place and allow us to say, “You know what? My workflow failed. I need to turn off its schedule.”
Sounds really simple. Like most of the stuff in our world, it’s really, really hard. We monitor our own uptime with a Prefect flow, so if we … Cloudflare was having all kinds of problems a couple of weeks ago, right? So there’s a situation where we couldn’t reach ourselves, and we started posting an instant, but the problem is, schedule kept running. This is on a segregated instance, obviously, so it was able to run even though the internet wasn’t on. And we needed to turn that off, but we had no way to do that because this concept of reaching through time and turning off this meta orchestration layer didn’t exist.
Where we’re working with our users now is a class of use cases that have to do with, “Yeah, we did the thing we wanted to do, but now there’s this new world of meta dependencies that we want to expose.” From our point of view, and I am talking our book a little bit here, it’s not so much the technologies they’re bringing to bear. We’ve always worked with data scientists and machine learning people who are doing crazy cool stuff. For us, it’s about seeing how they want those tools to work together in a way that, frankly, we didn’t anticipate. Starting really next quarter, we’ll start to introduce the features to enable that.
[Tristan Handy] I have a really fast answer there, something that I’m really excited about. This stack that we’ve been talking about this entire time tends to be used for analytics. It tends to be used to inspect data to make some human decision, and then use humans to go out and impact that decision in the world.
But increasingly, there’s a class of tools that takes the data that is in your data warehouse and pushes it back to operational systems, and I think that you will … As soon as that starts happening, you can automate the entire process. And all of these technologies become … The size of the opportunity, I don’t know if it doubles or it 10X’es, but you no longer just used it to make decisions. You’re using it to actually form the nervous system of your entire operating business.
[JL] Yeah, if this modern data stack gels, then the compounded effect it can have by having this fluidity of data can be extraordinary. I agree with that. You said something much better than what I said in a lot shorter time, Tristan, so I’m going to echo that.
[MT] Alright, so we’re at the hour, but we still have 157 people on this, so actually, people are enjoying this conversation. Why don’t we take one or two more questions from the chat, and then give it 10 minutes, and then we’ll wrap?
[JC] Okay. Awesome. There’s one from Josh here. It says: “We’re seeing a lot of recent focus on being able to manipulate data, Masking PII as a simple example, in-flight before it lands at the data warehouse or data lake, even for a schema at read products. Do you see this as a trend that will continue to accelerate? Are the pressures to do so purely regulatory, or are there other factors?”
[Tristan Handy] One thing that I will say with authority is that if you land data in your warehouse, it is very hard to ever have it go away completely. Those landing zones then get ingested and taken all over the place, and then it becomes a real challenge. And so I see the same thing that this question is asking in real-time and believe that is going to be one of the most compelling things that you could possibly do before the data lands in the warehouse in the first place. I think that exactly how one goes about doing that is not clear yet. I think that certainly there are very engineering-heavy ways to do this, but the easy solution to this, I don’t think, by and large, exists yet.
[JL] I can add a little bit, just from the regulatory point of view. We deal with an enormous number of customers in the healthcare and financial services spaces, and so that data is regulated either for privacy or other concerns. One of the ways that we approach this problem just as a business is we created a separation of where the data transformations take place and the fact that we operate only on metadata.
Once a business is HIPAA compliant or compliant with some regulatory things, you can either play entirely in that space, and that becomes a very expensive on-prem deploy if you’re a service provider, or you can’t work with them at all. And so that’s why we have this enormous effort to operate at a distance, exclusively on anonymized metadata, so we encounter this a lot. It’s a very complicated problem.
[MT] I’ll take the next one, so many good questions. One question from Jill. Can you speak to the balance between the value of data democratization and the importance of data reliability and governance, particularly for large enterprises? Philosophically, what parts of the data stack do you think should be more democratized versus more controlled? What features and processes can you build into tools that democratize data to maintain data reliability in governance?
[Tristan Handy] I love this question because the assumption that we always run into is that democratization or data quality or data reliability aren’t going to be intentional with one another. And I think that is a natural assumption to make, but if you look at the population of software engineers that exist today and push code to production applications, that has, I don’t know what, more than 10X’ed in the past 10 years.
The way that has happened is not that you have very extremely tight controls on who can push to production. What you do is you have mature CACD processes. You have DevOps workflows. You build these guardrails that create high-quality processes around code releases to production where you have your cake and eat it too. You have democratization. You also have governance.
[MT] Alright. One last one. It’s sort of selfish. I’m enjoying this conversation so much. I want it to keep going, but I guess we’re going to be at like 6:10. Jack, do you want to pick the last one, and then we’ll call it a night?
[JC] Yeah, sure. A lot of pressure on the last one here. Let’s make sure it’s a good one. Let’s see. Oh, this is a good one. This is from Kaya L. to close this out. “On the topic of automating business operations from the modern data stack, what do you think are the biggest challenges to making that shift from the current analytical focus?”
[Tristan Handy] Gosh, I don’t know. That’s beyond the event horizon. Seriously, I think that there’s going to be a singularity type event when people start doing this thing more, and there are going to be emails sent to large groups of people accidentally because somebody wrote a SQL query wrong. There’s going to be all kinds of bad things that happen, but we have to screw up to learn and improve. I don’t think that we can know, but it feels very obvious to me that that must happen. It’s not like the cycle of history goes towards more manual processes. That’s not how the world works.
[JL] I mean, I certainly agree. When you think about businesses transitioning, I think that’s the interesting, thought-provoking part of this. I think we very frequently see young companies just adopting it because it makes sense. I think we see greenfield exercises moving into it. I think it’s funny to think about it like a business transitioning because it’s hard to … Migrations are really hard, especially when we’re talking about such a key piece of the infrastructure, and so that’s the part that I’m paused on right now.
I don’t know how one would pitch that successfully. I think you would really … I think it would not be an easy thing to demonstrate in a large enterprise that you should just switch. I think you would need to see the gradual adoption for new business, demonstrate an ROI. Forget automated business processes. You have to use business processes to demonstrate the value of this, which to those of us who have the luxury to pick and choose today, this is a no-brainer. Why in the world would you not choose the right tools to achieve your purposes here? But I think that’s a challenging transition. I think that’s the keyword there.
[MT] Alright, folks, everyone, we should probably call it a night as tempted as I am to keep going. Look, this was absolutely fantastic, really appreciate both of you spending time with us and with this community. Congratulations on all the success. Both of you guys are doing incredibly well and are some of the clearly emerging companies in this space. Tristan, congrats again on the raise announcement today.
[TH] Thank you.
[JL] I love it.
[MT] Excited for what comes in the roadmap and all the announcements in the future weeks and months. That’s it. That’s a wrap. Thank you very much. Thanks, Jack. And thanks to everyone who attended this event. We’ll have a video soon, in the next few days or week. And everyone, have a good night, especially the international folks that attended from Europe and New Zealand. I don’t even know what time it is there.
[JC] Good morning to India as well. Yeah, thanks to everyone who joined.
[MT] Really appreciate it.