One of the biggest recent trends in the data world recently has been the rapid emergence of the “modern data stack”.
This stack is largely centered around the cloud data warehouse, with its massive scalability and elasticity capabilities. Snowflake’s blockbuster IPO this week, and the underlying performance of the company, demonstrate the level of excitement from both customers and investors about the data warehouse.
But the modern data stack is more than just the data warehouse, there’s a whole pipeline involving other technologies, where data gets collected, stored and analyzed. Downstream from the data warehouse, you find business intelligence solutions, as well as some machine learning platforms, to analyze the data. Upstream from it, you find solutions that focus on extracting data from various sources and loading it into the data warehouse (ETL/ELT).
This is where Fivetran comes in. A fast-growing company with a unicorn status, it automates data integration from source to destination, through a large library of connectors.
It was very fun to host Fivetran’s CEO, George Fraser, at our most recent Data Driven NYC event. We had a great conversation, both very approachable for a non-technical audience but also interesting for more technical folks.
The video is below, as well as a full transcript.
As always, Data Driven NYC is a team effort – many thanks to Jack Cohen for co-organizing, Diego Guttierez for the video work and to Karissa Domondon for the transcript!
By way of quick introduction to get some of the facts out of the way: Fivetran is a leading provider of automated data integration, and we’re going to talk about what that means. The company is based in San Francisco (which has very interesting orange skies today!). You guys have raised $163 million from Andreessen Horowitz — by the way, thank you to our friend Martin Casado, who is a previous speaker at this event, for making the introduction in the first place to George, so that George could join us today — as well as General Catalyst, Matrix, and for folks who care about these things, you have a unicorn valuation of $1.2 billion, last reported valuation. So congratulations on all of that. I’d love to start at a pretty high level with a bunch of definitions to make this approachable and interesting to a broad group of folks, and then we can dive into more technical details as need be.
That sounds great. I just want to make one minor correction to that wonderful introduction, which is Fivetran is based in Oakland, not in San Francisco.
Based in Oakland? Okay. Very important. As we were discussing, it’s an interesting place to recruit these days, you were saying?
East Bay is a great place to be.
Okay. Great place to be. In terms of definition, so there is this concept of the modern data stack. A lot of people talk about it and I’ve seen it on your website or the modern data pipeline. I’d love to start at the high level, talking about what that means and perhaps starting with the concept of data warehouse and what is that and what does that do?
Yeah. Well, the concept of the modern data stack is something we talk about a lot at Fivetran. The problem of data management and data analysis is ubiquitous. Companies have been doing it for decades. Even questions as simple as, “How much money did I make last month?” can be very complex in a large organization with a lot of systems. So there’s a lot of tools that people use to manage those systems, and the big picture of the modern data stack is that the tools that you use to manage data and analyze data are actually getting simpler over the last 10 years. You don’t need as many different things, because a few tools, most importantly the data warehouse, have gotten so much better over the last 10 years that you can use them as a Swiss army knife. They can do everything well enough that you don’t need to buy these things called cubes and all these other tools that have existed for decades. A lot of them are just going by the wayside because the data warehouses, in particular, have gotten so good.
To answer the second part of your question, what is a data warehouse? A data warehouse fundamentally is just a special database inside your company that you have designated to hold all of the data about everything that has happened at your company and to support data analysis. So the purpose of your data warehouse is not to show you how many unread messages you have when you open your app. That’s your production database that’s going to do that. The purpose of the data warehouse is to tell you how many unread messages everyone had yesterday at 9:00 AM. Those kinds of questions.
Maybe to put that in context further, so just some of the names of famous data warehouses, and a lot of folks are going to know this on the Zoom, but what are some examples of the main data warehouses?
Yeah. The most popular data warehouses today would be Snowflake, who everyone’s talking about right now because they’re about to IPO; Redshift, which is a data warehouse that you can buy from Amazon web services. Redshift was incredibly important because it came out in 2013. It was in the AWS console. It wasn’t the first really good fast data warehouse, but it was the first one that was cheap. So a lot of people bought Redshift who previously would not have been able to buy one of the enterprise data warehouses that existed before that. Then Google BigQuery is another important data warehouse today that a lot of companies use.
Great. What happens after the data warehouse in terms of analysis? I mean, there’s, I guess, the BI world and there’s the machine learning world. What do people do after the data warehouse in the pipeline?
So data warehouses are database management systems. So fundamentally, anything you can do with data, you can do with a data warehouse. In practice, the most common use of data warehouses is to support business intelligence dashboards. So these are dashboards… you’ve probably seen them if you’ve ever worked at a big company. They have bar charts and line charts and things like that. They tell you what’s going on. How many support tickets were filed this week? How many bookings has the sales team done? What is the average response time of the website? Or if you’re in the automobile industry, you might care about what is the average value of our total inventory from our suppliers, which is something we’re trying to minimize. It’s always very business-specific, what your key metrics are, but those are generated from data in the data warehouse, and then they’re presented in a dashboard of a BI tool like Tableau or Looker or Microsoft Power BI, or you name it.
So that’s definitely the most common use case of data warehouses, but then you can really do anything with them. We have customers who run billing out of their data warehouse. There’s all kinds of use cases that can happen with data warehouses, because at the bottom of it, they’re just databases that are designed to support large queries, that scan lots of data.
Great. All right. So now let’s switch to the world where Fivetran operates mostly, which is the world of what happens before the data warehouse. How do you get data into the warehouse, and maybe talk about ETL and the evolution of ETL into ELT and maybe explain what that means.
Yeah so a data warehouse doesn’t do anything without data in it. So the first step to do anything useful with the data warehouse is you have to move the data from all the places that it lives. For example, Drift, who we just heard from. You have a bunch of data in Drift. If you want to incorporate that data into centralized reporting, you’re going to need to make a copy of that data in your data warehouse and then continuously keep it up to date. The same goes for your Salesforce, your production database, your marketing tools, your finance systems, your inventory systems, every system in your business in order to achieve that goal of making the data warehouse the place that has the definitive copy of everything that has ever happened you need to move all that data into the data warehouse.
That’s where Fivetran comes in. So at its core, Fivetran is an automated, data replication system. So we replicate all the data from all those systems where it lives into the data warehouse. The key thing that makes us different from what came before is that we have a quasi-religious focus on automation. So the underpinnings of doing that are very complicated. It is a mammoth task to be able to do, essentially, change data capture out of every single application, every database, Oracle, my SQL, PostgreSQL, you name it, in any configuration that the user can make into any data warehouse. It’s this gigantic, extremely wide software engineering problem.
But the key thing that we do is we hide it all from you. So it’s very hard for us. There’s a lot of very hardworking software engineers at Fivetran who focus on this and effectively reverse engineer every business tool on the internet. But you don’t have to see any of that. From your perspective you just connect the source, connect the destination, and you see a complete replica of the data show up in your data warehouse.
Now, this all sounds great, but software engineering is all about trade-offs. There is a trade-off of doing it the way we do it. In a traditional ETL pipeline, in a pre-Fivetran world, you wouldn’t just replicate the data from business systems to your data warehouse. You would actually transform it. That’s where this acronym ETL comes from. Extract, transform, load. So maybe you wouldn’t replicate all of the data. You would only replicate the things that were relevant to the analysis that you know you’re going to do. You might transform the data by doing some joins or some aggregations to put it in more of a curated format that’s going to be more efficient to query.
Basically, you would do a bunch of optimization of the data on the way from the source to the data warehouse. With Fivetran because we’re so big on automation we not only don’t do that, we really can’t do that, because it’s different for every customer. There just is no way to automate that process. It would destroy our vision of just this power cord that you plug into the source and you plug into the destination and there’s nothing else to talk about. So that means when you use Fivetran, or if you use another similar tool, or if you build your own data pipeline that follows Fivetran-like principles, it’s not going to be as optimized, right? You’re going to replicate everything, even data that is not actually relevant to the analysis. You’re doing an inner data warehouse. The data is going to arrive in a normalized schema. So it’s going to be well organized and clean, but it’s not going to be particularly optimized for the queries you want to do in your data analysis. So that means the data warehouse is going to have to do more work in order to get the data from the format we deliver it into, into the lovely bar chart that tells you what you need to do tomorrow when you get up at your business.
It’s a trade-off, but fortunately the data warehouses now are so fast and so cheap that it’s a very easy trade off to accept. The additional cost of compute and storage to replicate the extra data, to do those steps, those transformation steps, inside the warehouse are so small now. They’re less than what it’s going to cost you to pay your data engineer for a week to build you that custom data pipeline. So it’s a very sensible trade off to make in the world that we live in today.
We often term this process to contrast it with ETL, we and others will often refer to it as ELT, the idea being that you extract the data from the source, then you load it into the destination in a normalized schema, and then you do your own specific transformations after it arrives in the data warehouse.
So you’ve built an entire library of connectors. How many do you have?
So we have about 150 connectors today, depending on exactly how you count. Some of them overlap in what they do, but we have about 150 connectors. They fall into a few major categories. So there’s database systems like my SQL, PostgreSQL, Oracle, MongoDB, you name it. There’s apps like Salesforce or Drift or Marketo or NetSuite. Then we support some file data sources, so things Dropbox or S3. We support events, so you can send events directly to us, like web hooks and we’ll just load those into your data warehouse. Schematize them and load them.
How does the automation part work? Do you ping the sources at regular intervals? How does that work?
It’s different for every source. That’s what makes it such a mammoth task to build connectivity to all these different data sources. For every single source we support, we have to go in and we have to understand how the API works. We have to understand how the underlying schema of that system works. Then we have to figure out a sync strategy for every source. We have to figure out a change data capture strategy where we’ll call up this API endpoint and say, “Hey, tell me what changed,” and then it’ll give us a bunch of data. Then often there’s weird rules about how this column represents the time it was modified, except when this happens, and then it’s something else. Sometimes I joke that the real value of Fivetran is all the millions of if statements inside of our code that map out all of these scenarios.
So a big part of it is just elbow grease, working away on all these different data sources. Databases in particular are crazy. You could make a whole company just out of syncing one database, and we sync, I think, seven different major databases at this point. They have these very complex change log formats. But then there is also a shared platform that we have internally. So over the years, we’ve discovered a lot of fundamental principles of what makes connectors reliable, what makes it easy to build and maintain these connectors.
So if you’re a Fivetran software engineer, when you build a new connector, it’s not like you just sit down and start a new Java project, MBN new. No, you’re writing against this internal API that solves a lot of the shared problems that all data sources have. That internal API is always changing. There’s always another iteration of it coming where we discover new things. We derive a lot of leverage from that over time, over that sharing of code and of discovery of these fundamental principles of what you need to do to make good connectors.
You mentioned the word reverse engineering. Presumably you have to partner with a lot of those, right? I mean, you have to work hand in hand with the source to figure it out, or is that something you can do without?
We do that more now. So in the beginning, we were on our own. We were small. Nobody wanted to talk to us. Sometimes it was a battle just to get them to even give us access to the API. So in the beginning, we were just on our own. We would go and read the API docs. We would set up a test instance and do experiments, play around, try to break it. I would always joke that we were just reversing the company’s API just back to the normalized schema and the database underneath.
Then we would put customers on it and they would break things and they would call us up and be like, “Hey, this row in my data warehouse doesn’t match the source.” The really critical thing we did that laid the foundation for our eventual success was that we said that it was our responsibility to make the data match, which is actually unusual in the field of data integration.
Most data integration tools, they see themselves as a platform, right? So they give you all of this toolkit, all these Legos, and they say, “Yeah, but it’s ultimately up to you to achieve correctness. You need to assemble this thing together. We’re going to call the API endpoint and we’re going to load the row, but if the data doesn’t match, that’s your problem.” Whereas we said, “The data will match. The schema will exist in your data warehouse and whatever you do in the source, somehow we will find out about it and port it over.”
That meant every time for years there was a mismatch, that was a bug report and we would go figure out what the heck happened and fix it. The great thing was that those fixes were cumulative. So our connectors have gotten better and better and better over the years. Then the other thing that’s happened is the organization has gotten bigger. There’s more engineers. We’ve learned from this experience. So we’ve gotten better just at reading the API docs and figuring out what’s going on beforehand. We’ve also gotten better at getting a set of alpha customers to be the first ones to try it, to help us make sure we didn’t miss any corner cases.
Then lastly, what you mentioned a moment ago, we do often actually just work with the source directly. We just get on the phone with one of their software engineers and basically say, “Hey, tell me how your database is set up. If I call this API endpoint and I give you this query parameter, is that going to get all the new data? If not, how do I get all the new data?” That’s always our question.
Then sometimes they actually changed the APIs for us. We’re working on publishing more about what do you need to do to make an API that’s friendly to replication for data warehousing, whether it’s by us or one of our competitors, or just the customers building their own data pipelines. We’re trying to put this out into the world and say these are the characteristics that make a successful API. There’s some things about that on our blog now under the header of the Fivetran protocol. So we’ve written one blog post, which is a technical guide to exactly how does this work, then there’ll be more coming that are more like, “Why is this valuable? Why should people do this?
Presumably that needs to work in all environments, right? Whether the source is cloud based, on-prem, hybrid. Does that work everywhere? Are there unique challenges for the different situations?
The API data sources we support are all cloud-based apps. There are a few things like JIRA you can in principle deploy on-prem. I don’t know whether any of our customers are actually running JIRA on-prem. It is less common than maybe it once was. There’s a few other data sources that where in principle, they might be actually running on-prem… Well, there’s no way for us to tell. It’s just a URL. Then we support databases. We do actually sync a lot of on-prem databases. So we have a lot of customers where their production systems are still on-prem and the databases are running there. But then their data warehouse is in the cloud.
The data warehouses we support are all in the cloud. They’re always in the cloud. The reason is the cloud based data warehouses are just so much better. There is a reason why Snowflake is about to IPO for a bajillion dollars, and it’s because how data warehouses-
Exactly. I heard it was two bajillion. It’s because cloud data warehouses are so much better at just even… I mean, they’re a particularly good one, but then the category is just so much better than the previous generation of data warehouses, because if you think about it, what is data warehousing all about? It’s all about storing lots of data. Like I said a second ago, you want to store everything that ever happened in your business. Maybe some of it you’re not even querying right now, but you still want to store it because you might want to query it someday.
Then the other thing you want to do is you want to run analytical queries where you’re going to scan tons of data. You’re going to scan the history of everything that ever happened in order to answer certain kinds of questions. So that means you’re going to need tons of storage and you’re going to need tons of compute, but only sometimes. Now, what is a good place to get a lot of storage that you can get just whenever you need it and to be able to grab a lot of compute whenever you need it? The cloud. So for that reason, cloud data warehouses are just way better than the previous generation of on-prem data warehouses.
We talked about transformation a minute ago. I see you have a product now, right? You have Fivetran Transformations. Is that correct? How do you go about it?
Yeah, that’s right. So when Fivetran delivers the data, it’s going to be in a normalized schema that is a sensible schema. The data’s clean. There’s not duplication of the same information across tables. Everything’s up to date. But that schema is not going to be customized to your particular needs, right? It’s going to be the native schema of whatever the data source is. For you to do anything useful with that data, you’re going to need to transform it, typically into a dimensional schema, is what most companies will do. They’ll turn it into a dimensional schema, which, if you’ve never heard of it, it’s basically a simplified view of the data where you make some simplifying assumptions knowing what kinds of analysis you want to be able to support later. So everyone’s going to have a different dimensional schema.
Somehow you need to orchestrate this transformation, right? As a practical matter, an analyst is going to write a bunch of SQL queries that transform from one schema into another. But somehow you need to store those SQL queries somewhere, you need to keep track of them and review changes to them, and then you need to actually run them. So how to do this has been somewhat of an open question for the last few years. We’re not the only ones who have been pitching this ELT, modern data stack, but it was a little bit the Wild West of how do you actually organize all this SQL that you’re going to write to transform and curate your data? Over the last couple of years, this great tool called DBT, which is an open source community. It’s also a product and there’s a company called Fishtown Analytics that’s the primary sponsor of DBT, but it’s really emerged as the way to orchestrate transformations in a data warehouse. So we’ve really thrown our support behind DBT as an approach. It doesn’t mean we’re not also supportive of other ones, but we’ve placed our bet that this is the way most of the kinds of companies that buy Fivetran are going to solve this problem. So you’re seeing, for example, a lot of… We’re now writing packages in DBT format, which are basically pre-written transformations to help get you started on writing your own transformation of the data that we deliver.
Great. What’s a Powered By Fivetran? That looks like the most recent product announcement, at least I saw. What does that do?
Powered by Fivetran is a way for you to embed Fivetran into your own application. So I’ve talked a lot about data warehousing, and business intelligence. And there is another way to solve this problem, which is vertical integration. So you can build your own data warehouse, and you can hire a bunch of analysts who write SQL queries, and then, use a tool like Tableau or Looker to build dashboards that you present internally. Lots of companies do this, lots of companies are going to continue to do this. However, it’s a huge amount of work. It’s incredibly expensive, not so much because of the tools, but because of all the people.
And so, you’re going to run out of steam. There’s only so much you can do, especially, at a smaller or a mid sized company. And the other way to solve this problem is through vertical integration. So there have been for years companies that focus on data analysis in some domain, there’s a lot of them in marketing, but there’s also companies that focus on data analysis for consumer packaged goods. There’s a Fivetran customer who focuses on data analysis for dentists. Okay, because you know what? There’s like tens of thousands of dentists, and they’re not going to hire analysts to write SQL queries for them. But there’s still a lot of useful insights that they can derive out of their data.
So what Powered by Fivetran is, is a way to embed Fivetran into your own application, so that you can offer all of Fivetran’s connectors. And then, the idea is that you also embed a data warehouse, so you could choose some database to power whatever you’re building. And that might be Snowflake, it might be BigQuery, it might be Redshift. It might be something else, we support a lot of different relational databases as targets. And then, you build your own software on top of that, that does some kind of domain specific, vertically integrated, set it and forget it, data analysis. And you can accomplish really great things by doing this, because if you’re a product company building a data analysis product for a specific domain, you can put way more work into it than any company ever would just hiring their own analysts to analyze that one data set.
We have lots of customers who are doing this already. There’s a lot of cool products out there. Some of them you won’t even know that they’re Powered by Fivetran under the hood, and I think there’s going to be a huge wave of this over the next few years. I think this is a giant trend in the data technology space, is vertical integration. Both because there’s just like a centralization of effort, but also because it allows you to accomplish things that are just impossible in a traditional data warehouse, BI context, especially if you start to talk about machine learning type use cases. That’s just not realistic for most companies to actually build and deploy that themselves.
But, if someone comes along that can offer a prepackaged solution that does all the pieces from soup to nuts, you can actually like bring some of that stuff that companies like Netflix does to a wider audience by packaging it like that. And we want to be part of that. We’re not going to go build that, but Powered by Fivetran is our way to provide this component to say like, look, we will solve the access to data problem for you. You can launch on day one with connectors to everything that Fivetran supports. It’s up to you to figure out how to turn that into awesome insights on the other side.
Very interesting. I mean, I may have a customer or two for you.
Yeah, it’s funny, venture capitalists are our most effective source of referrals actually for the Powered by Fivetran program, because they have a lot of portfolio companies that are building businesses that meet that description in one way or the other. We’re always happy to help.
How about that? VC’s actually helping with something, interesting.
VCs are great. I don’t know why they get so much guff on Twitter. It’s unfair.
That’s a whole different conversation. Speaking of which, maybe let’s talk about, go to market a little bit on my own and make sure I cover this before we switch to the questions. So put something like this, who’s the buyer, who are the users where are the buyers? Where do the budgets come from and how do you sell to them?
Well, it depends on the scenario. If it’s a Powered by Fivetran scenario.
In the regular case.
The product team or the founders of a company. If it’s a traditional data warehousing BI customer, then there’s a couple scenarios. Sometimes it’s a line of business user, usually in marketing. So marketing data is often the initial use case for Fivetran at a large company. And the reason is marketing data, there are so many data sources, they’re all constantly changing because your team is always adopting new tools and changing the configuration of these systems. And so, for many companies, they just have never successfully gotten all of their data into their data warehouse. So it’s still kind of a green field in that sense. And even if they have some of it and there’s other data sources that are missing. So that’s often the starting point for us within a company is like the VP of Marketing is trying to figure out their customer cost of acquisition across all their channels and they need those other six data sources in the data warehouse in order to do that, and Fivetran can come in and solve that problem for them in an hour. Which is amazing and a great place to get started. Then we go from there into other departments. Maybe that means the production database. Maybe that means the event stream of like what your users are doing on your website or what your users are doing with your product out in the physical world. And you name it. I mean, just any data use case within the company.
Interesting. You get the motion started for the business side. And obviously the trend of the last few years has been that marketing folks are becoming data folks or the new CMO is not a brand person it’s a data person. But are you finding that you need to do a lot of education, not about the need that they perceive, but about what needs to happen next. Do those folks understand what a data warehouse is, what a connector is and all those things. How does that work?
Usually by the time they’re talking to us, they know what a data warehouse is. They know what a BI tool is. And Fivetran is arriving to solve an access to data problem that they need to solve in order to have their dream dashboard complete. Sometimes I joke that we’re sort of like the plumbers building the house. Like they go to the general contractor first and that would be either like the data warehouse or the BI tool. And then like, they kind of work their way through the project and they’re like, “Okay, well now the toilets need to work. Now we’re going to hire Fivetran.” Not that I’m diminishing the gloriousness of what we worked on.
Your parents must be so proud.
We have a utilitarian ethos at Fivetran. I mean, you’ll find this, if you talk to people who work at Fivetran. We understand that Fivetran is fundamentally a utility. And the thing that you care about most is that we have connectivity to all your data sources, and that it’s working. Like we don’t sit around and try to get you to spend lots of time logged into Fivetran.com. That’s not the goal. Like the goal is that the pipes work.
Great. Just maybe one last sort of category of questions before we switch to Q and A. I got some things online, like the history of the company and some of the early years that I thought were really interesting. There’s a bunch of folks that attend this Data Driven NYC event that are entrepreneurs or think of building companies and all the things, I thought it was interesting to hear the stories. So first of all, I read or heard somewhere that you and your co-founder are childhood friends or families have known each other for generations. But what’s the story?
Yeah we’ve known each other a long time. So we’re not related, but we knew each other growing up and we’d see each other in the summers. Our families had cabins on the same lake. And in fact that’s been going on with our ancestors for just about 100 years now. So we go way back, all of our relatives know each other. And there were two really important consequences for the company of that long standing relationship.
One of them was, as I mentioned it took us a while… Well as I think you mentioned, it took us a while at the beginning, we’ve been around for a while. We started at the end of 2012. It was a couple years before we got our first customer, there were many iterations before we got to our first customer. So it took a lot of persistence to get there. And it took a lot of trust. When you’re going through a multiyear journey to product market fit, you need to hang on to that trust that the other co-founder is still in it like you are. Really easy for that to erode when you go years without having success. And there was that really deep well of trust for us because we had known each other for so long.
The other really powerful thing that we got from that long standing relationship was some serious fear of failure. Fear of failure is an underrated motivator. Let me tell you. But when all of your relatives know that you’ve started this company and you see all these people every summer at the lake, you realize that you had better make it succeed, or you are going to be hearing about it for the rest of your life. So I highly recommend that to people when you start a company just tell everyone, just get yourself as far out on a limb as you possibly can, because then you’ll just be terrified that you have got to make this thing succeed, or you’ll never hear the end of it.
Wow. And at some point I read as well as part of that finding that product market fit… To put it in context for people, the company started in 2012 and VC financing is no indication, except that VCs tend to invest when things are starting to work out. So I think the first round was in 2018 and then you did multiple rounds, like back to back, like ABC sort of like in compressed timelines. So there was a period of like five years where you were sort of two years building and then like several years iterating to get to that stage. Is that the kind of timeline we’re talking about?
Yeah, that’s right. We raised a modest amount of money from angel investors, right way back at the beginning. So a few hundred thousand dollars. And that’s what sustained us for those first couple of years, we did not pay ourselves a lot. And we didn’t spend a ton of money on AWS. And we only had one other person who joined the company in that first phase. And then we raised… The first significant round was a seed round in 2017, from a family office called CEAS. And then there was the series ABC in fairly short succession because we started to grow so fast. One of the funny things I learned about fundraising from that was that if the company is growing really quickly, this funny thing happens. Even if you don’t spend the money, that same pile of money looks smaller and smaller compared to the size of the company and the amount of money that just goes whooshing through every month.
So no matter how capital efficient you are, you end up… The faster you grow the sooner you need to raise again, unless you want to just sit there and have one month’s payroll in the bank account, which I don’t think you want to when you have a lot of employees. So it is this funny paradox of fundraising, that like, the timing between rounds kind of doesn’t matter. If it’s a big series A or a small series A, you’re going to do the next round if you grow fast before a long, which was not obvious to me in advance. And yeah, as you said, there was a long journey there at the beginning. We originally were working on a more vertically integrated tool, a lot like the kinds of things I talk about with Powered by Fivetran. So we were trying to do something like that way back in 2013, 2014 and didn’t succeed. None of them were really any good.
But we did build our own connectors as part of that. And we came to understand how poorly solved the data connectivity problem was. That all the tools that existed at the time, they just didn’t really solve the problem. They were these tool kits that you would use to build your own connector. And that’s not what you really want. You want the data from here to there. You don’t want a bunch of pipes and wrenches that you can use to build your own plumbing. That’s just not that helpful. Like if you’re going to give me that, I’ll just write a bunch of code. I don’t really want your product. So that’s how we discovered this problem. And we met our first customers as part of that. And then that turned into a product market fit one fateful week in 2015. Late February, 2015.
Wonderful. All right, Jack, do we have… Let’s switch to some questions?
Yeah, we’ve gotten plenty of questions. So thanks guys for a great conversation. Let’s start off with Tony’s question. This is from Tony. “I presume that Fivetran’s tools are multi-cloud, what does that mean for your customers? Does that mean that they can Work across multiple clouds? They have the freedom of choice regarding the cloud data warehouse they use?”
Yeah the customers can choose whatever cloud data warehouse they want, and it can be in any region, of any cloud. And the sources are often someplace else, either in a different cloud or on-prem or a lot of older SaaS services are just hosted in their own data centres. They’re not in any of the public clouds. And so the data is often crossing from cloud to cloud. Fortunately, because all of our data pipelines are based on change data capture. We’re only actually moving the data that’s changed each time we run. The amount of data is not actually that large. It’s quite surprising. Like the number of terabytes that go through Fivetran are much smaller than you would expect, just because everything is like a change stream. I always tease our data warehouse partners, that if they were to look at our AWS account, they would be like, “That’s for one customer, right?” We’d be like, “Nope, that’s everyone.”
“What’s the business / competitive relationship between Fivetran and the data warehouses? And to what extent can they, or do they offer their own ETL capabilities?”
They all offer something. But they’re all more in the vein of a toolkit that you use to build your own pipelines, which is valuable and important. Like there are scenarios where it’s worthwhile to build your own data pipeline. Typically, if the data source is really large, it’s something that’s just yours. So Fivetran’s never going to build a connector to it. And it’s relatively stable. Like it’s not constantly changing configuration. Then you can build your own data pipeline using the tools provided by Snowflake or by Google or by AWS, and be very successful. Where Fivetran really shines is where you’re connecting to a data source to change is really frequently or a set of data sources that you’re always changing. And they’re like public data sources that lots of people have, either databases or apps. That’s where Fivetran really shines.
So this one is regarding compliance. “So when using ELT regarding compliance, things like GDPR, PCI, et cetera, does Fivetran help with this? When data is moved from the source database to the warehouse?”
Yeah. There’s a little bit of a wrinkle in that story. I just told you about like, well, you want to replicate everything just so you have it all there in case we ever need it. Which is what if there’s personal identifiable information. There’s a couple of things we do to help solve that problem. Number one, you can actually suppress at the table and column level data. So if you just don’t want to sync something, you can just uncheck it in Fivetran’s UI. That is our one configuration thing. I said, we’re all about automation. We do have this one thing. There’s this checkbox tree, where you can go in and select what you do and don’t want sync.
And then the other thing we can do is we can hash the data. So if you want to sync it, but you want to scramble it using a consistent hash. So you still have like an identifier there that you can join on, but it doesn’t contain any personally identifiable information. And then the last tool in the toolbox is on the data warehouse side. It’s just delete. You delete things if you find you need to, if you accidentally sync something you shouldn’t have, or if you sync something and then decide, what, I don’t actually want to have this column or this row in my data warehouse, you can delete it.
“Is Fivetran able to apply MDM, RDM governance concepts in the load process, or is that assumed to occur in the transformation steps within the data warehouse?”
So MDM stands for master data management. There is… For those not familiar with the term data warehousing is very much a discipline that’s been around for a long time. There’s a lot of big ideas in data warehousing, things like… There’s all this terminology like surrogate keys and temporal tables and bitemporal tables and master data management and systems of record. And these are great ideas. You can… I don’t know if anyone has it, if there is like a degree in this, but you really could go to school for data warehousing and learn all these great principles that have been developed over the years. Master data management is a little bit of an umbrella term that can mean a bunch of different things. I think the most important thing that I’ve heard it mean is basically deduplicating between systems. So like the classic example is there’s a record in Salesforce with a person’s name, and there’s a record in Marketo with that person’s name. And is this the same person? And there are tools that exist to help you resolve these kinds of conflicts.
So there’s usually some mixture of automation and then manual intervention. Sometimes you need a human to come in and say, “Oh, that’s really the same person. Or that’s not.” Fivetran does not get involved in that from our point of view, that’s fundamentally something that should be done in the data warehouse after the data is delivered. So we deliver like a replica of what lives in all the systems. And then if you want to do that kind of deduplication, you do that on the data that exists in the data warehouse. And this is different from how it’s historically been done. But the big advantage of doing MDM and other similar things in this way is that it’s non-destructive. So you still have the original data. If you make a mistake and you realize later, “Oh, I said, these were the same person, but they’re actually not. They’re two different people.” And maybe like that data has since been deleted from Salesforce so it’s not even there anymore. It’s still possible to recover the original data because you’re doing it in two steps, Fivetran copies, and then you modify and you write out to a different set of tables.
“So traditional ETL scripts are pretty brittle. Can you describe how Fivetran keeps up with things like schema change in the data source or API object deprecation. Does this cause things to break? Do you have to go back and recreate those pipelines?”
Yeah. So schema changes were kind of our original marquee feature that we actually automated the process of schema changes. And it is truly automated. It means that our connectors are a lot more complicated than the ones you would build yourself. So when someone writes an ETL script for their particular company, typically the way they’ll write it is it’s designed to support the schema that exists on the day that they write it. And if you change the schema, or if you change the configuration of Salesforce or Marketo, or you name it, then you’re going to actually have to change the ETL scripts to keep up with that.
The way Fivetran works is different. For every data source we support that has a dynamic schema, so that would be like a database or a system like JIRA, where you have custom fields, what we always do is we first connect to the metadata endpoint. We find out what’s there. And then we construct our queries dynamically based on what we learned about the metadata endpoint. So this means our connectors are like 10 times more complicated than the ones you build yourself.
Because we have to kind of go the long way around and make sure we support every scenario. And we also have to think about things like what if the schema changes right in the middle of the sync. Because that happens once you have enough customers, that happens every hour. So we have to contemplate all the cases and corner cases and scenarios. The advantages, because Fivetran is a product company, we only have to do this once. So like once we get 10 customers on that particular connector, then we’re net ahead in terms of the amount of work done, right? So we have this great advantage that we amortize the cost of all this extra work across many customers. And there’s this cool thing that happens where the different customers show you all the different corner cases of the system. So we sync like so many different Salesforces. We have seen every weird… I shouldn’t say this, this is like jinxy to say this because there really are so many things, but we have seen every weird thing that you can do with Salesforce. And there’s a lot of them, but like it is finite.
And then we’ll close with one softball, hopefully. “So why Fivetran and not a TenTran or ThreeTran?”
Fivetran actually is a pun on Fortran, which the software engineers in the audience will appreciate. Fortran was a programming language a long time ago, still is, not used much anymore, but it was one of the first big programming languages. And Fivetran is just a pun on that. Doesn’t really mean anything in particular, other than that. But just a pun on that. And I will say that I am very happy with it as a name all these years later. It’s easy enough to spell. It’s relatively easy to remember. It doesn’t really mean anything. And nobody used that term really before us, except for oddly enough, a band in England in the 2000s was named Fivetran. So if you go on Google trends and look at Fivetran, it is basically just a perfect leading indicator of our growth. And that’s a really cool thing to be able to do if you’re an enterprise software company, to be able to just go ask Google Trends, like how many people know about the existence of my company this week. So I highly recommend it to anyone naming a new company, choose a name that if you go into Google Trends, is basically zero. Because then you’ll know when you’ve broken through into public consciousness.
That’s great advice. Actually, it’s a way underrated level of complexity around finding the right name for the company. I agree. It makes sense. It sounds like it can be a verb and all those things, very cool. On that note, thank you so much. This was a really, really interesting and, hopefully educational for the folks that were newer to this world. But also detailed enough for the folks that understand this world well, so thank you very much. I personally really enjoyed the conversation and I’m sure the group did.
Thanks for having me. Thanks everyone for their questions.