Pinterest is near and dear to our hearts at FirstMark because we had the good fortune of being the first institutional investor back in 2009 when the company was just getting started (fun fact: the founders were in New York for a brief moment in time before moving to the Bay Area). Pinterest has had a remarkable ride ever since, and it’s a $49B market cap public company at the time of writing.
So it was a particular pleasure to welcome Dave Burgess, Head of Data Engineering, to come and talk to the Data Driven NYC audience about all things data at Pinterest.
We covered a bunch of interesting topics, including:
- Pinterest’s newly open sourced project, QueryBook
- The stack Pinterest uses to manage is 400 petabytes of data
- The use cases for data analytics and machine learning at Pinterest
Below is the video and below that, the transcript.
(As always, Data Driven NYC is a team effort – many thanks to my FirstMark colleagues Jack Cohen and Katie Chiou for co-organizing, Diego Guttierez for the video work and to Karissa Domondon for the transcript!)
TRANSCRIPT (lightly edited for clarity):
[Matt Turck] I’d love to start with your background and your path to becoming Head of Data Engineering at Pinterest.
[Dave Burgess] Going all the way back, I grew up in a wee village in Scotland, just 500 people, and I studied maths and physics, computer science, and numeric analysis.
When I was doing my masters, I was asked if I’d come and help teach a course at Stanford, and I had never heard of Stanford at that point. I asked my advisor, ‘Where is Stanford?’ and he said, in Connecticut. [Laughter] I looked it up and the old atlases, we’ve got the maps out and sure enough Stanford [editor note: Stamford] was in Connecticut. But two weeks later he then said it was in California. I came across for summer and enjoyed it so much that I kept on coming back here every summer while I did my doctorate at Oxford, and then I did a postdoc at Stanford.
Then after that, I wanted to get into the tech world. I went into computer telephony integration and data integration. I started at IBM, back in the day, in New Zealand, and then came back across here, worked for a startup which was a B2B startup doing business-to-business integrations called E2open. I was the fourth engineer there and actually that company went IPO eventually after 12 years. Then I joined Yahoo! and that was when we were in a really fast-growth phase, 2006, and joined there when the data engineering was just 50 people and it grew to a 600-person organization. Yahoo! I was actually the Chief Architect of all the global display advertising business, and that was a billion-dollar business. I’ve come all the way through the ranks of being an engineer to senior engineer to then principal engineer to being an architect.
As a quick aside, it’s amazing actually how important Yahoo! has been to this whole Big Data ecosystem. So many fantastic people have come out of Yahoo!
That’s right. Yahoo! created Hadoop for those that don’t know. Hadoop was born in Yahoo! and then it was spun out as a separate company. But within that data organization things like Kafka eventually came out of LinkedIn that was based on work that we had been doing at Yahoo! and many other things had come. We were doing machine learning and behavioral targeting back 15 years ago, more than 15 years ago in advertising. We were the first in a lot of things and also did a lot of data privacy at that stage, which is only now really in the last couple years come in to the forefront of being important. One of the great things about Yahoo! is that there’s so many people spread around the Bay Area and probably New York as well that have been in Yahoo! It’s a really good network of folks, and it was a real pleasure to work there. I was there for six years.
Great. And then you went to Twilio, right?
Well, just before Twilio, I was a CTO for a couple of startups. First one in Dynamic Creative Optimization. This was for advertising and web page optimization. Then I went into a startup for doing a social network for marketings. I spent about five years trying my luck there. A great experience to go into a start, but then I got attracted to joining Twilio. I joined Twilio to be the Head of Data Engineering there, and then came across to Pinterest, just over two years ago, just before the IPO so I could join the party.
Just today, Pinterest and your team announced that you open sourced a project that has been developed internally: Querybook.
Yeah. We’re really excited that we’ve open sourced Querybook. It’s used extensively within Pinterest. It’s basically a human friendly IDE, a notebook, for being able to create data queries and doing analysis and sharing of your queries. It’s actually a very simple but very powerful tool.
It’s a collaboration layer, so that multiple people can query the data and share their queries?
That’s right. We have had this philosophy at Pinterest, going back five years ago where a lot of the data was shared with everyone. Now we locked it down a little bit more, but you’re able to go in and query many, many hundreds of tables. Starting with other people’s queries and getting some documentation around those queries is really important. That’s one of the things that Querybook does is you can take someone else’s query or you could just explore the tables that are available and create queries really easily, of course.
It passes on the query to an underlying search engine or query engine?
That’s right. All SQL-compliant API. We query using Presto to Hive to Spark SQL to MySQL but you can do it with other engines too.
Out of curiosity, how are you going to just promote this new open source project? How are you going to create a community around it?
We’ve already got a couple of companies that are working with us on it. We’re looking for more engineers to join us in the open source efforts of that.
If anyone is interested, then actually please check out querybook.org. Anyway you can download it and get it running within a couple of minutes. If there are any issues, just ping me, let me know. We really are keen to share this with everyone.
How does an organization like Pinterest think about open source? What are the drivers, in this case, for you guys to decide to open source it now?
Overall, we’re a strong advocate of open source. We use open source extensively within Pinterest. We like to give back as well. We’re often contributing patches back to other open source code. When we can, we open source things that are not directly competitive with Pinterest that are good for the community.
All right, that’s the big news today, the open sourcing of Querybook. I’d love to take a step back from this specific tool and if you could help us give us a broad picture of the data engineering, analytics stack at Pinterest, what do you use, the various diverse tools in the various systems?
Data engineering we have many, many tools and we can maybe cover that a bit later. But for the analytics itself, we focus on using Hadoop and Spark. We’re moving more and more to Spark now because it’s faster. But Hadoop still scales really, really well.
We have a query platform which is Presto and Spark SQL. We used to use Hive, but we’re migrating off Hive to Spark SQL. We just have the two main engines.
Then we have a workflow system on top of that that’s built with Airflow which is also open source.
The way that we get all this data is using Kafka. Kafka has clients in all of our serving systems and gathers the data from these clients and then pipes it to the back end.
Now all of Pinterest is running on AWS. We’ve put all of this data into S3, and it’s a massive amount of data. We have more than 400 petabytes of data. We spend quite a bit of time putting into different buckets and making sure that the buckets are partitioned correctly, so that these engines will perform at scale. Also it’s important to put it in the right format as well. We want to try as much as possible to put it into Parquet columnar format for these query engines to be fast.
Maybe for the broad audience, define what Parquet is?
Yeah. It’s just a columnar data format. A row format would be a comma-separated value, but with a columnar format, you can scan through columns really, really quickly.
Do you have a data warehouse or does everything run on this big data lake around S3 and Hadoop?
We do have a few data warehouses. We have one data warehouse that is specifically for our external metrics. These are metrics that we give to Wall Street that have to be really well vetted and validated and curated.
Is that a Redshift since you’re on the AWS stack?
We actually use an Oracle for that where it’s a small amount of data and then the vast amount of data we separate between what’s sensitive data and includes personally identifiable information from the data that is not sensitive. But we also have access controls to all of this data. But all this data is available in S3, and it is available through all these tools.
It actually brings an interesting question. How do you identify the PII? Have you built special systems to see what’s sensitive information?
It’s really identified right from the source of the data where it’s created. We have a schema for each piece of data that comes into our system, and along with the schema, we will actually annotate that certain fields are Personally Identifiable Information. It can also be that a combination of fields can become personally identifiable as well. There’s quite a bit of work that goes into just getting whether something is PII or not PII.
What are some of the use cases?
We have many, many use cases. Analytics is a small part. Just about everyone in Pinterest is using it for analytics every day, but we also have a massive number of experiments.
We have got our own experimentation platform where we’re doing A/B experiments all the time to try and improve Pinterest and try and improve our ads and our ad relevance. We have about a thousand experiments running in parallel at any point of time. We’re constantly iterating. So that’s another use case.
Then we use it a lot for machine learning. We have about 80 different use cases of machine learning and that’s mostly from this data that comes into S3 and AWS.
What are some of those examples of machine learning? I know that visual search at some point was a really important breakthrough.
Yeah, it is. Visual search is actually, if you do a comparison between the visual search in Pinterest, just take a photo or something or get a picture and then upload it and see what the results are between Pinterest and Google and Bing and other search engines. You’ll find that Pinterest is usually the one that comes out tops. We’re really proud of that. We use a lot of the data that we’ve got in order to do that. Our Pinterest users, which we call pinners, are creating these boards of pictures and videos. Usually the boards have similar images within them or at least a similar topic. We can use this underlying data that’s been organized by all our users to come up with really, really high relevance results.
We use it for things like classifying pins, which is the picture, what is it within the pin, so we can identify objects within an image or within a video, and we can also identify whether it’s content that is safe to show our users. We want to have safe content. We remove content that doesn’t meet that bar, and we also have recommended pins. One of the things people love about Pinterest is what they love being inspired and finding something that they enjoy and then finding something else they enjoy and continuing. Being able to find these recommended pins and showing recommended pins that people can go to have a really wonderful inspirational experience is something that we work on as well.
We also have shopping pins as well, and finding related pins that, like I was mentioning before, we’re trying to keep our ad and our shopping ads as relevant as the organic content itself. It’s a seamless experience. It’s not a jarring experience like with other products. We do a lot of work on that side as well.
Since we’re talking about machine learning, what are some of the tools that are being used by the machine learning folks and frameworks?
In machine learning, two or three years ago, it was a Pandora’s Box. Everyone was trying out everything. With all these different use cases, it started to become more of a maintenance burden.
What we wanted to do and what we’ve done is created a machine learning platform that glues together the best of breed open-source products out there. We use TensorFlow from Google, we use PyTorch, scikit-learn, we use MLflow which is from Databricks.
It’s a registry of machine learning models, and we also have our own format for storing features within Pinterest, both for doing batch learning of models and then doing the serving model.
What we can do is within a few hours you could create a model, train a model and then deploy to our production servers with thousands and thousands of servers by using this MLflow repository.
That framework is not open source yet, right?
That’s not open source yet. We’re basically building upon … we are considering. If there’s interest from the community here, then we’ll consider even harder whether to release it.
You mentioned Kafka. Talk about batch versus streaming, what are some of the use cases, and when do you use streaming.
Streaming for us has taken off in a big way over the last 18 months. We basically, first of all, bring the data through Kafka and then we use Flink for streaming video. Then it can go to a variety of different use cases. For example, one of the things that we do now is when you create a pin, you can actually have some analytics around that to see how many people are viewing that pin and what’s the uplift. We have these fast creator analytics. We go all the way from Kafka to Flink and then into Druid. We use Druid for the real-time analytics. But in order to show a pin, we also want to make sure it’s safe and that we are showing it in a relevant way, we would have to do all the categorization of that pin and then also filter out the things that we really shouldn’t be showing, or we really don’t want to show. There’s that whole pipeline has got about 50 different machine learning models in itself.
Also on the advertising side, we also want to do things like budget allocation. When you’re spending on an ad campaign, you want to make sure you’re showing the number of ads that you’ve been asked to show, but not too many more, otherwise you’re eroding your own budget. We use the streaming to basically do ad counting as well on how many ads will be shown for each campaign. Then we also have a real-time ads campaign manager where you can look at the performance of your campaign and also start or stop and change that campaign, and a lot of that comes through going with streaming, going through Flink, and then into Druid as well.
By the way, you mentioned a number of those open source frameworks like Druid to Kafka to Flink. Does an organization like Pinterest work directly with the open source or are the vendors involved as well because Kafka has Confluent. Druid has Imply and Drill. Flink is part of Alibaba now I think. Because you have the engineering resources, do you take the open source and work with it directly?
Yeah, we do. We’re fortunate to have the number of engineers to do that. One of the reasons why we work directly either with the open source community or the companies that are also working on that is that we want Pinterest to be up all the time. If there’s a problem, we want our engineers to be able to go and look at the code and be able to fix it and deploy it into production and have very, very little down time. If we were dependent on a third-party for that, then maybe the downtime would be a little bit longer.
We also want to develop our own features and our own capabilities that are needed for Pinterest, and so building on top of open source and then putting it back into the community is a great way both for the engineers in terms of their careers and what they enjoy doing and then also great for the community and it’s also great for Pinterest, because we stay up to date with what the open source latest versions are and getting the benefits of that as well.
How are the data team and the machine learning team organized? Are they separate? Do they work together? Are they centralized or distributed?
Data engineering at Pinterest comprises the serving part, all the online systems. That includes the online databases like MySQL and key-value stores like RocksDB and HBase and also includes Druid platform. We have our online systems. We also have all the batch systems and analytics and experimentation platforms. We discussed that previously. Then the machine learning piece as well as a platform. We provide the glue and the underlying infrastructure with TensorFlow and PyTorch and all the other components available to our internal customers.
With machine learning, machine learning has actually distributed to everyone. We have many, many machine-learning engineers and many, many use cases, and those machine-learning engineers are embedded within each organization. There’s quite a few within shopping, quite a few within our advertising business, and also different parts of our product and trust and safety, and then we have a separate product analytics and data science team as well that uses all the capabilities of data engineering.
What new data trends or products, in the overall ecosystem, are you most excited or even just curious about?
Well, I’d say there’s three, I don’t know how to choose one. I am really excited about machine-learning capabilities that are being … and this has been advanced very, very quickly, and it’s great to see lots of companies basically providing offerings for improving what we do with ML, and to get to a point where we can have people that are data scientists that don’t even code that can just build models and be able to deploy those to production. That’s really what we want to get to within Pinterest as well. This is an area that is really hot. It’s been hot for several years now and continues to be hot and it’s changing a lot.
The next area I would say is data privacy. I think data privacy is really, really important to everyone, and we all want to have our data protected and know that it’s been used in the right way, and to have more controls around your own data. There’s quite a few products that are coming out now that are doing data privacy compliance, that are doing data lineage, that are doing data leakage detection to help you with GDPR. I’m excited about that area because there’s a movement I think over the last … certainly over the last year within the U.S., earlier in Europe and earlier in California, but now in the whole of the U.S. about what data privacy really means and this is a big question in politics today.
Then the third one is streaming. Being able to do things in real time is very cool. It benefits users in a great way if they take an action and then you can have a model that generates what they’ll enjoy to see next, then that’s really welcome.
We have a bunch of questions from the audience, let’s start with this one: “How did the migration from HBase to Druid work out? Also curious if you looked into additional data stores such as ClickHouse or Redshift?” That’s a question from Rohit.
Yes, we did a migration from HBase. We were using HBase for analytics and for many other things as well. We started out with using Druid for one use case within Pinterest and proving out that that worked really well. Being able to slice and dice with millisecond or second latency was really good. It works really well for this first-use case, and then we decided, okay, we’re going to bring this into data engineering and be part of the platform and then we have a bunch of other use cases.
We were also starting for experimentation platform. We had so much data within our HBase cluster that it was getting really slow and very difficult to manage overhead for our engineers. We decided to migrate that to Druid. The maintenance cost has gone way down, the latency has gone way down, the actual cost of running the infrastructure way down. Really happy about it.
A question from Pierre. How do you observe and monitor the data flowing across these different systems and tools?
That’s a great question. There are different kinds of monitoring. There’s systems-based monitoring where you want to see that each system is performing well. There’s also monitoring of data quality going from one system to another to make sure that you’re not losing any data along the way. There’s other kinds of monitoring. We use different tools for the different things. Actually for data monitoring you’d want something like Splunk or another tool like that. We built our own. It’s called Goku, and we’re actually contemplating open sourcing. Goku is a time series database in which you can do time series analysis. It’s great for production monitoring system. If people are interested in that, then let me know and we might consider open sourcing that. That was based off Facebook’s Gorilla compression algorithm. We built a time series database around it.
A question from Gadalia. Can you please speak more to your use of Kafka and Airflow. I guess you probably call it Kafka, Airflow is a scheduler. Do you use Airflow as such or did you build something on top?
Well, we’ve done both. We actually had an existing workflow system that was built in house about seven years ago, and then just over the last year we migrated to Airflow. We’re still using the APIs for the existing system so that we could migrate. But we have over 2,000 workflows running in production, and migrating that alternative Airflow was difficult. We’ve also built some UIs on top of Airflow. We have one … there’s some internal names here, but there’s one that basically allows you to compose a workflow really easily just through a drag-and-drop interface. Then we have another one for being able to take many, many workflows and optimize which jobs to run and when.
When you’ve got many thousands of workflows, you can start to decide, well, I’m not going to re-run a job in order to compute for this workflow, and that’s why I’m going to run it just once and it can be used for both downstream. A couple of tools, which we’re also considering open sourcing.
Big open source year coming up.
Yeah, yeah, yeah. Well, it all depends on … I think it really depends on interest from the community. We want to make sure that there’s interest there, because it’s actually quite a bit of work for us to open source something and to manage it over time. But if there’s benefits and others want to chip in and chime in, too, then that’s great.
Kafka basically brings all the data into S3 and then we’re doing batch workflows using Airflow. The batch jobs take the data from S3. They usually lead to Spark or Hive or Presto chops or Spark SQL and just process the data in every step and process the data back to S3 along the way.
Just one last question from the audience: “What is the pathway that a beginner should go through to become a data engineer?“
There are many different kinds of data engineers. I think that, first of all, you need to learn probably Java and Python or learn one of those first, and from there you want to decide, well, do I want to become a system engineer where I’m actually building the underlying platform, or do I want to be someone who’s integrating all of these systems together, or do I want to be more on the data analysis side and be more of a machine-learning engineer or data scientist. Deciding what your love, your passion is I think is really important. You should go with what you’re really passionate about, because if you love something, you’re going to learn it really well, and you’re going to enjoy working in that area for a long time.
Based on those answers, then you decide what to learn and that’s either starting to look at open-source software and choose an area that you really enjoy, or one that you’re already working on within the company, you want to do a deeper dive and get into open source and start contributing to open source or start building applications on top of the data for users and see if you can get traction with your users, both internally and externally.
Thank you so much for spending your time and super interesting. Really appreciate it.
It’s a real pleasure to be here. Thank you.