While it’s been around for 15+ years, Reddit has been on a tear lately: a $367M Series E round announced a few weeks ago, rumors of an IPO, and plenty of Internet action with r/wallstreetbets in particular.
Interestingly, there was a major gap for many years between the central role Reddit has been playing on the Internet and its relatively small team size. While companies like Facebook are largely AI companies (see our conversation with Jerome Pesenti, Head of AI, Facebook), Reddit’s data team was tiny.
Enter Jack Hanlon, VP Data at Reddit and our guest at our most recent Data Driven NYC event. Jack has been tasked with leading the data team into rapid growth, and we had a really interesting conversations, in particular around the following points:
- How is the data team at Reddit organized? (preview: data science, data platform, machine learning, search)
- What’s the data stack? (preview: switch from AWS to GCP, Kafka, Airflow, Colab, Amundsen, Great Expectations, Druid/Imply…)
- What are the key use cases for data science and machine learning at Reddit?
- A book recommendation: “Invisible Women: Data Bias in a World Designed for Men”
Anecdotally, Jack is our second speaker in recent memory who was a regular attendee in the early years of Data Driven NYC, before ascending to leadership responsibilities in a major Internet company! (the other being Alok Gupta, who spoke about leading data at DoorDash).
Below is the video and below that, the transcript.
(As always, Data Driven NYC is a team effort – many thanks to my FirstMark colleagues Jack Cohen and Katie Chiou for co-organizing, Diego Guttierez for the video work and to Karissa Domondon for the transcript!)
[Matt Turck] Welcome Jack. Reddit needs no introduction – the world’s most popular internet message board. As of February 2021, the seventh most visited website in the U.S. and the 18th most visited in the world and very much a part of internet history. The company was co-founded by Steven Hoffman and our friend Alexis Ohanian back in 2005. And that was actually, the first YC class ever, so very much a part of internet lore. A thriving company and very influential company, as we know and as we’ve seen that recently in the news. Awesome to have you. How does one become VP of Data at Reddit? What was your path to it?
[Jack Hanlon] That’s a great question. I took a fairly nonlinear path career wise. I’ll spare you some of the adventures there, but I was actually a music major in undergrad, passed through a number of different positions, all along the way, sort of thinking I might… I did a stint in sales, I did a stint in marketing, I’ve been in product positions and I couldn’t quite find the place that was right for me. There were no data positions at the time when I was getting out of school, it wasn’t the same deal, even software engineering. When I graduated college, I was sort of thinking of software engineering in some ways as like guys in khakis, working at Oracle. Even when I was in undergrad and getting out, I was like, yeah, I’m not going to do that. And then two years later, I was working at Google. The world changed just a lot.
So I went through a bunch of different positions trying to figure out what I wanted to do. Eventually I became an entrepreneur. I started a company and totally failed at it. And I had no idea what I was doing. And I started another company and that one worked a little better. We scaled it out. It was an ad tech company called Kinetic, and we scaled that out to four offices in North America. Over the course of time on that one, I figured out sort of that intersection point where data creates products, where products need data and where needing a sense of how a business functions is critical in the context of having those conversations, it was sort of a sweet spot for me. As I was figuring that out, sort of these more senior data roles were coming into being. We exited that adventure and then I joined jet.com before we had launched to build out data at Jet. We scaled that out. We sold that to Walmart, and then I became an executive at Walmart on the Jet team. Left and consulted for a variety of firms. I was going to take time off and travel with my wife. She said I was the worst unemployed person in America because instantly I was consulting for four or five places, because more and more places want these data leaders, and there just aren’t that many people out there who actually have operated at large scale and these things. I’ve been extremely fortunate in that regard. And Reddit was one of the ones I was advising and all of a sudden I was consulting. And then I woke up one day and I worked there.
Great. What do you cover as of now?
Sure. There are four organizations within data at Reddit. Those organizations are:
– Data science – so that is sort of everything from analytics over to what I would think of as decision science. So what are sort of all of the decisions we need to make that are very internally oriented. There’s potentially machine learning there on things like lifetime value or mapping subreddits to geos or things like that.That’s across all of consumer ads and people, data science, so it’s unified across Reddit.
– Data platform – that is data infrara, data engineering, the experiment platform, BI tools, instrumentation, and privacy and governance. Those six teams.
– Machine learning, which is separated into two groups. So there’s foundational machine learning – that is sort of ML research, content understanding, user understanding, personalization tools. Effectively, what’s our user feature store of embeddings that we want ads in consumer and all downstream products to be able to leverage to create models, to do personalization and to be able to do ML in a scalable way.
– Search: Then on top of that federated machine learning, which are all the people we hire to work on these downstream teams to actually materialize these models. And then the fourth organization is feeds and search. So that’s the infrastructure relevance and front end experience of the Reddit feeds and the Reddit search products. So never a dull moment.
How are you guys organized as a group? Are the data science team and the machine learning team centralized or are they spread out in different parts of the organization?
Data science is hub and spoke. Effectively that means all of those folks report into the Director of Data Science, who’s one of my team members. But actually when we’re in an office setting, they sit with the downstream teams, they go to those downstream teams sprint reviews, they work much more directly with those teams. It’s quite a nested model there.
The same is true of federated machine learning. The same is true of part of data engineering. Again, nested on these sort of pillar teams.
Then meanwhile, for something like feeds and search, we are more a product pillar. And so we have people nested with us in that case from other groups. We would have designers from the design team there, et cetera.
A pretty matrixed organization overall with a sort of larger break at the top between consumer ads and sort of center of house things like Finance, HR, et cetera.
How do you decide on roadmap and prioritization?
I mean, that’s a hard question normally, but I’d say for us, it’s been even trickier.
When I came into Reddit, I came in in November of 2019. 15 months ago. 15, 16 months ago. And at the time there had never been a real data organization at Reddit. And there had never been a senior data leader at Reddit, which is why I became an advisor there. They wanted to make investments in data, but they weren’t sure where to make them. And so originally I was just going to make recommendations on how they might scale that organization out. Then the opportunity became so compelling that I ended up joining. But as a result, in terms of prioritization, not only are we sort of doing the prioritization that you would think of on data science, which I think of is maybe like a doctor or a physical trainer model where it’s like, “Ah, so you come to us with a problem and we can give you guidance and you have to actually participate in that.” I could do pushups in front of you, but you’re not going to get any stronger. How do we work with you to help you get to the right place that you need to get to? There’s of course that relationship, but there’s also just building basic capabilities. When I got to Reddit, there was one person working on data engineering.
How big was the organization at the time?
The data at the time was 20 something people… The thing, all of Reddit is hilariously small, right? So when I got to Reddit overall, I’d say Reddit was maybe 600 people, 14 months ago. Right now, maybe 800.
By the end of the year, the data organization I just described will be probably the third largest organization at Reddit with 120 or so people. And at 120 or 130 people, we are still going to feel really thin for the volume that we do and for the things that we do.
So it’s a fascinating environment in terms of, we have sort of enterprise size and needs with the startup part of the journey in terms of frankly revenue, funding, things like that. We feel the tension in a lot of ways there and a lot of the partners we work with and a lot of the problems we’re trying to solve, and that keeps it exciting.
That’s fascinating. I had no idea this was the case. I knew you guys raised a big ground and I guess there are rumors around that it’s trying to get on the journey for IPO...
Yeah, for context, I came in, we were doing probably about 55 or 60 billion events a day into a data warehouse that *one person* was managing. It was definitely exciting, deadly exciting.
Since we started talking about the data warehouse, let’s talk about the data engineering part of the house. What’s the stack? What does the data organization at Reddit run on?
As you mentioned, Reddit has been around for 15 years, so we have some stuff with some cobwebs on it and some stuff that’s pretty state-of-the-art. It really depends on what part of the organization you’re looking at.
From the top view, I’d say we’re AWS prod with, originally, a lot of Postgres for a variety of prod systems.
We actually transit almost all that stuff over to GCP, and we use TensorFlow and BigQuery for our analytics and our model concerns for a variety of reasons that we could spend all of our time just talking about.
We use a lot of open source related tools outside of that.
Like most people, we have a love-hate relationship with Airflow and things of this nature.
Pretty heavy Kafka shop, pretty heavy Kubernetes shop.
What about BI? What does that run on?
That’s a bit of a work in progress. I’d say that I walked into that being a bit fragmented. We were using Mode at that time. But Mode’s explorer functionality left something to be desired. The executive team, like many executive teams, like Tableau. You know that that can be a little pricey. Then, there’s an in-house tool as well that the sales team uses that suits their needs a bit better.
I would say that’s something we probably need to do some cleanup on, but that environment is always an interesting one.
For the machine learning and data science part with a bunch of Jupyter Notebooks, that type of thing?
Yeah, and Colab certainly, because we’re on Google, the Colab Notebooks and Jupyter Notebooks work well for us there. Other things we didn’t mention, we’ve built a bunch of in-house stuff for classic plumbing issues. When we’re talking about schematization or we’re talking about sand testing, a lot of this stuff has to do with data quality, we’ve had to build.
Meanwhile, we just started working with a small company a couple months ago called Anomalo. It does anomaly detection on data quality and measurement stuff. I’ve built that stuff before at Jet. We built that stuff by hand, and so I know the investment it takes. But they’ve been a really good partner.
Very good. When you mentioned open-source software, you mentioned Airflow, open source scheduler, any others?
Yeah, we’re using Amundsen, the Lyft data dictionary. We try to contribute to that project as well. I think there’s an opportunity for that project to be pretty exciting, where you could use something like Great Expectations to do the data unit tests and output the rules you write into that into Amundsen, so that Amundsen has a collection not only of the data that’s there and who’s using it, but also the rules around how it should be generated and the success or failure of tests.
I think, with a couple other pieces around it, you could start to build something that ends up being pretty beefy in terms of helping your team be successful on what is an area that is just still, in my mind, what are the hardest things to solve for at a company when you reach this crazy scale. With the scale of the data, the diversity of the data, everything else, when we get into: what do we have, what’s working, what’s breaking, where do you find things – these problems, they get really, really difficult. Well, I can’t overstate how difficult at the scale we’re talking about.
Do you have something with data lineage to precisely track where those issues appear?
Yeah, I think that’s an area where we’re more building into it. It’s interesting. You would know more about this than me, Matt. But the companies I see, it feels like a number of the startups I see in this space get pressure to expand, to be more horizontal solutions. Then frankly, we want them for one thing, but not for the three other things that they want to do. So, it’s like, are there players out there who could do this for us plug-and-play? Yes, but what’s the lock-in look like for these other things? If their solution’s a little off on something else, what does that mean for us? It’s been a little mixed as you get further into the quality lineage infrastructure area in terms of how much we can lean on other people versus build.
Actually, I love this – Chip, the speaker after you is asking questions in the chat, so I’m going to relay some of those in real time. How many rules do you have for Great Expectations? Also, do you want to, maybe, explain what Great Expectations is?
We’re just getting started with that, so I would say the least of the things I mentioned, of course, Chip, identified the one I know least about. But what it is, is it’s effectively unit tests for things that are going to output data. What should the data output look like? There’s different feelings about how you might achieve this. Developers are constantly shipping code that’s going to change data. But a lot of times, we’re looking at the events that we’re sending through off of new features and saying, are these events firing the right way? Is the data output what we’re expecting?
You can both write the test there to identify there. You can do stuff for event validation to say, “Events that don’t match certain things get dropped,” and get put in the penalty box, which we’ve done and said, “Developers, you can’t actually join these to anything. You can’t actually use them. If they started to get to penalty box, you just have to go fix this.” Then, we’ll send the stuff back through.
Then, you can also do anomaly detection, which we’re doing, which is stuff has made it through and is wrong. Data quality, there’s simply no area where you’re going to solve it. I think you have to look at where are the areas where it breaks down, and how do you mitigate the risk in each one of those areas? In terms of if we scale out Great Expectations to all teams, number of rules, it would be hard for me to imagine how big that’s going to get.
Well, for anyone who’s interested, we’ll have the CEO of Great Expectations, or the company behind Great Expectations sometime soon.
Cool. Count me in the audience for that, that’ll be fun.
What do you guys use for ETL or ELT?
This is one of those ones where I think we’re also thinking about what we want it to look like as well. I’d say, in general, because there’s also streaming, we’re also thinking a lot about where streaming needs to be the path versus where batch needs to be the path. I think we’re ending up leaning heavily into being a Kafka shop, building or the stuff into Baseplate into more of these core services and thinking about then… We could talk another day about what the path forward is on this. I think this one is in motion.
That was data engineering. Let’s talk about machine learning or data science use cases.
I’d say they end up being quite different for us, the two teams. So, if I start at machine learning, I’ll say Reddit has the greatest conversational text corpus in the world. I can say that with confidence, both because you can see the type of research that comes out of it, but also Google, Facebook, Microsoft, and OpenAI all use Reddit’s data to train their conversational AI models.
We’re talking about some companies in the world who have access to the greatest data companies in the world who chose to use Reddit’s data to train those models because there’s just a general understanding that if you get into conversation in terms of depth of conversation, breadth of conversation, that Reddit’s data is the gold standard there. With that in mind, what’s amazing is I got to Reddit and I think we probably had about three people working on machine learning in the whole company. All of a sudden, all of the other people in the space are actually getting more leverage out of Reddit’s data than we were.
Part of our objective has just been how do we build up these core foundational capabilities to do things. But that could be a long journey. If I’m just saying, we’re going to go on this five-year journey to get to core capabilities, that’s probably not going to get us where we need to be. So, we’re simultaneously tackling, how do we get this foundational work in? Then, how do we do some really exciting things?
Last year was the first year that every Reddit surface saw personalization for the first time. It doesn’t sound exactly super novel, but personalization at that scale is challenging, and personalization at that scale can be really tricky. It’s no secret that misinformation, filter bubbles, etc., people are becoming a lot more cognizant of how problematic these systems have become in other places. I think Reddit has been very fortunate that we had not gone into that much earlier, so now we can go in much more eyes open.
The kind of progress we saw just last year has been amazing. You’ll see this year that the home feed and people’s home feed will be breaking the subscription wall and having recommendations in there for people. We have a lot of work around how we’re making sure that we are explicitly breaking filter bubbles and things like this. I think we’ll have some really exciting features there to make the experience, not just the same and not dangerous in the same way.
The main use cases are around user experience effectively?
Yeah, so much around user experience, but also around ads experience. Ultimately, we have this conversational text, that’s amazing data. The other thing about it, though, is we don’t have first names, last names, addresses, and we don’t want that. We don’t need that. We don’t need to know that you’re Matt and where you live. For us, those things are not super relevant. We want to see the things that you’re passionate about.
As a result, you can think, in some places, your identity is performative. It’s the identity that you choose to showcase. One of the things I’ve loved about /wallstreetbets or other things there is how much people are just willing to talk about their wins, but their losses. What other social network experience are you having, where someone’s like, “I lost 10k today a day on a bad idea.” It’s just not very common.
That’s not the self I want to put out into the world, potentially. But under pseudonyms, people are very comfortable being very honest. With that honesty and with the depth and breadth of content they’re willing to engage in, you actually get a really interesting sense of what people are passionate about. When you have a strong sense of what people are passionate about, about a broad set, I actually don’t want to know personal details about this person, I want that privacy protected for somebody. I don’t want to handle those details and have that be at risk. All I want to do is create an amazing experience for people and connect them to the communities they care about and be able to connect them with advertisers that they also could care about.
On the advertising front I read somewhere, you also had to redo a lot of the infrastructure. Going back a little bit to the infrastructure, there’s a switch to Imply…
To Druid and Imply? Yeah, that’s part of why I was kicking the can down the road on some of the data engineering stuff. I think we’re trying a bunch of different things in different places to see what’s going to best suit us. There’ve been some really exciting uses of Druid and Imply on the ad side that seemed to speak to their needs, that probably would speak to the needs on video well as well. We’re thinking about how unified does that stack need to be, what things are working best in what places. It’s exciting for us. We have so many different things going on with the different media formats and the different things that we engage in that we can test a lot of things and figure out what’s best for us.
Very cool. All right, as we are getting close to the time, a few more questions from folks. Muhammad asks, what motivated the move to GCP from AWS, to the extent you can talk about it?
Redshift isn’t very good for the things that we wanted to do. I think SageMaker, there’s a lot of promise there and a lot of capability. But in terms of having an analytics tool or the thing that people are requiring Redshift feels way behind. Whereas, a number of years ago, it was fantastic. The core architecture there has not kept up. Snowflake, on the other hand, so you could put Snowflake on top of AWS. I think that’s a very viable solution that we certainly would consider, but the switching costs from one solution there to another is quite time-intensive and quite expensive, so we’d really have to love them to make that change. GCP, with the other tools we built on top of it and things we’ve built in it, including so much of our ML infrastructure, is serving us well. But there’s significant transit costs there. There’s a definite trade-off. Redshift really wasn’t ready for prime time.
Great. Actually, there’s several very good questions. Let me pick maybe one, possibly two, before we wrap. TJ says that he or she has been tasked with the responsibility to ensure that their AI ML efforts are not biased. How would you recommend another data leader to think about that?
I love that question. We could spend the whole time talking about that. I spend a ton of my time working on that topic. One thing I can say is there’s an amazing book to expose you to an early part of this process. A writer, Caroline Criado-Perez wrote a book called The Invisible Women. This book is about data bias.
What it talks about is, let’s say you’re in an environment where you are using data to create policy. That’s fantastic. People want to be data-driven. But what happens when the data that you’re using to drive that policy has bias in the collection of the data, this is a very real reality, many times that bias directly impacts a variety of marginalized communities, either communities of people of color or communities of women, or both. The book is fantastic in that regard.
But I would say, look at all of the steps. The way that data is collected, the way that it is aggregated, the way it’s reported on, the way it’s generated, what’s happening in data quality. I would say breaking out all of the steps from collection all the way through will help you identify the risk factors in each area. One other thing I could recommend is LinkedIn has done some really interesting publishing about how, in their experimentation tool, they are reporting on the impact to marginalized communities. I love this idea.
It’s one thing to say you’re A/B testing these two ideas, and that version B is outperforming version A, but that’s usually looking at aggregate groups, or maybe it’s some large cohorts beneath it. Rarely is it looking at very small subgroups that could be negatively impacted. LinkedIn had an experience where one feature they did was net positive, but as it turns out, it directly was harming women’s ability to get exposure to certain types of senior roles. That’s what exposed them to the idea that, even if this feature looks good, we could be impacting communities that can really be harmful.
I also love the idea of providing explicit optics into communities of interest to really understand the impact there. But I’d say breaking it apart into chunks and identifying the risks in each of those areas has got to be step one. It’s a huge problem, but it’s an incredibly exciting one. It’s everybody’s problem in here. Whether or not you are a leader working on it today, it’s coming down the road for you, for sure.
Great. Thank you. Just to close, and actually, you’re going to be our guinea pig because we haven’t done that. But let’s try just rapid fire three questions. Actually, you’ve just answered one. Which is what’s your favorite book, newsletter or podcast recommendation?
I think that book is fantastic. Definitely, read it.
Then, what new data trend or product in the overall data ecosystem are you most excited about or curious to learn more? What has come up on your radar that seems super interesting?
Related to what I was just talking about. Consumers are more and more conscious of how systems are taking their data and creating these outcomes. They’re more and more conscious of how they’re being led down certain roads in certain cases. I love the questions that are coming out about ethical use here. The way I describe this sometimes is, look, we don’t just let anyone build a bridge. You get certifications, you get training to build a bridge.
A lot of times, you see data-powered products being built, where it’s someone like, “Building a bridge? I’ve driven across a couple of bridges before. I bet I could draw one. Let’s build one.” The impact of building social networks, the impact of building products that are data-driven and personalized has real-world impact on people and can have real-world harm. The lack of cognizance and thoughtfulness about the design there is something that consumers should be upset about and certainly should be well-informed about. I love that those conversations are happening more now. That’s only good.
Great. On that note, thank you so much, Jack. This was terrific, really interesting. We appreciate your coming by.
Thank you. I’ve been a long-time attendee. This was a delight. Thank you, Matt.